Poetry is a complex genre of literature, and it is often something that we wouldn't expect computers to be able to understand. Poems play with the language by intentionally breaking its structure and by expressing meaning in figurative language such as metaphors. A lot of the expressive power of poetry lies in its ability of provoking certain imagery, certain emotions, in the reader. 😌 As it turns out, it is possible to analyze Finnish poetry automatically in Python. 😃(more…)
Linguistics is a category in a broad sense of the word. It could very well be Philology. Posts varying from syntax to sociolinguistics fall into this category.
When we Finns write online or message with friends and family, we hardly ever resort to using standard written Finnish (kirjakieli). Instead, we write as we would speak (in puhekieli), as using the standardized spelling would make our communication sound a bit too official. Now, for computers, this is quite a challenge, as most of the datasets used for NLP tools represent well curated normative text. 🤯 Luckily, as always, there's a solution for processing spoken Finnish text in Python. ☺️(more…)
Turku dependency parsers, both the statistical and neural ones, are no doubt among the most important recent NLP tools developed for Finnish. Without them, doing NLP for Finnish would be extremely difficult. This posts explains how to use them easily to parse Finnish from your Python code. 🐍(more…)
VISL CG3 is a neat tool for running constraint grammars (CGs) for things such as morphological disambiguation or syntactic parsing. Grammars of this formalism have been developed for a great many endangered Uralic languages boosting their NLP. And these CGs are actually easily available in UralicNLP for Python programmers.
Even for UralicNLP, a tool called vislcg3 needs to be installed on your machine, and it might be a tricky task if you cannot find the correct binaries for your operating system. Therefore I tailored this guide.(more…)
Morphology can be described as the smallest information bearing unit of the human language. Words that are inflected can be divided into morphemes, e.g. -ed in talked is a morpheme that adds the meaning of a past tense into the verb talk; -s in dogs pluralizes the noun and so on. These morphemes that are added to words are known as affixes. There are different kinds of affixes and in this post we are going to look at them more closely. 🤓 (more…)
If you have done language technology in a Nordic country, you have probably heard about Korp. And by now, you have probably developed some sort of a love-hate relationship to it. My initial thought was: Korp is nice, but so what 🤷🏼♂️, I need to access it programmatically for it to serve any use. The fact that the API description is somewhat hidden online and that not all Korp services are open about the url of their API doesn't really help at all. 😩
Oh, sarcasm, sarcasm. The thing that puzzles us so much. It takes some knowledge of the person to know if he is being sarcastic or not. Regardless of how sarcastic we were ourselves. But is there any science behind it? As it turns out, there is, and I wrote my MA thesis about it in Spanish. But if you don't have time to read it, just read this post instead. 😅 (more…)
If you are interested in generating Finnish with a computer (NLG), you have probably already run into the problem of the complex morphology and syntax of Finnish. In addition to knowing how to inflect words, you have to take agreement into account. That for example, the verb agrees with the subject's number and person: minä syön, sinä syöt and so on. And there's more: case governance has to be solved too. A verb takes its direct object in a certain case, for example, you would say uneksin autosta but näen auton. Such is the problem of natural language generation. 🤷🏼♂️
Languages can be grouped together in different ways. One can put languages together based on their family relation (e.g. Uralic languages, Indo-European languages) or the area where they are spoken. But maybe the most interesting and eye-opeing way of grouping them is by their morphology. As it turns out, there are only four morphological groups for languages and all spoken languages fall into one of them. (more…)