Korp and Python. Access corpora from your Python code!

If you have done language technology in a Nordic country, you have probably heard about Korp. And by now, you have probably developed some sort of a love-hate relationship to it. My initial thought was: Korp is nice, but so what 🤷🏼‍♂️, I need to access it programmatically for it to serve any use. The fact that the API description is somewhat hidden online and that not all Korp services are open about the url of their API doesn't really help at all. 😩

Luckily, once again, yours truly has been typing in some code to make your life easier. 🤓 Behold, my very own python library for querying Korp. 😊

Install my Korp API Python library

The installation couldn't be any easier, for my library is distributed through PyPi.

All you have got to run is: sudo pip install korp

An example

In the following example, we are going to use Korp in CSC Kielipankki by setting the service_name to "kielipankki". Other possible values would be "GT" for Giellatekno and "språkbanken" for the Swedish Språkbanken.

We will list all corpora in Kielpankki and use the ones starting with FTB2 to limit our query to the Finnish TreeBank v2. In the query, we specify that we want to have concordances for the lemma koira. As a result, we get the total number of hits in the corpus and all the concordances.

from korp.korp import Korp

korppi = Korp(service_name="kielipankki")
corpora = korppi.list_corpora(limit_by_prefix="FTB2")

query = '[lemma="koira"]'

total_number, concordances = korppi.all_concordances(query, corpora)


More information

My Korp library can do more than just query concordances. Take a look at the wiki page to see a listing of everything it can do. 😁 In case of questions or comments, don't hesitate to contact me. ☺️

More description of the possibilities of the Korp API is available on Kielipankki's website. For more advanced usage, it couldn't hurt taking a look at it.