Turku dependency parsers, both the statistical and neural ones, are no doubt among the most important recent NLP tools developed for Finnish. Without them, doing NLP for Finnish would be extremely difficult. This posts explains how to use them easily to parse Finnish from your Python code. 🐍(more…)
Linguistics is a category in a broad sense of the word. It could very well be Philology. Posts varying from syntax to sociolinguistics fall into this category.
VISL CG3 is a neat tool for running constraint grammars (CGs) for things such as morphological disambiguation or syntactic parsing. Grammars of this formalism have been developed for a great many endangered Uralic languages boosting their NLP. And these CGs are actually easily available in UralicNLP for Python programmers.
Even for UralicNLP, a tool called vislcg3 needs to be installed on your machine, and it might be a tricky task if you cannot find the correct binaries for your operating system. Therefore I tailored this guide.(more…)
Morphology can be described as the smallest information bearing unit of the human language. Words that are inflected can be divided into morphemes, e.g. -ed in talked is a morpheme that adds the meaning of a past tense into the verb talk; -s in dogs pluralizes the noun and so on. These morphemes that are added to words are known as affixes. There are different kinds of affixes and in this post we are going to look at them more closely. 🤓 (more…)
If you have done language technology in a Nordic country, you have probably heard about Korp. And by now, you have probably developed some sort of a love-hate relationship to it. My initial thought was: Korp is nice, but so what 🤷🏼♂️, I need to access it programmatically for it to serve any use. The fact that the API description is somewhat hidden online and that not all Korp services are open about the url of their API doesn't really help at all. 😩
Oh, sarcasm, sarcasm. The thing that puzzles us so much. It takes some knowledge of the person to know if he is being sarcastic or not. Regardless of how sarcastic we were ourselves. But is there any science behind it? As it turns out, there is, and I wrote my MA thesis about it in Spanish. But if you don't have time to read it, just read this post instead. 😅 (more…)
If you are interested in generating Finnish with a computer (NLG), you have probably already run into the problem of the complex morphology and syntax of Finnish. In addition to knowing how to inflect words, you have to take agreement into account. That for example, the verb agrees with the subject's number and person: minä syön, sinä syöt and so on. And there's more: case governance has to be solved too. A verb takes its direct object in a certain case, for example, you would say uneksin autosta but näen auton. Such is the problem of natural language generation. 🤷🏼♂️
Languages can be grouped together in different ways. One can put languages together based on their family relation (e.g. Uralic languages, Indo-European languages) or the area where they are spoken. But maybe the most interesting and eye-opeing way of grouping them is by their morphology. As it turns out, there are only four morphological groups for languages and all spoken languages fall into one of them. (more…)
HFST (Helsinki Finite-State Transducer Technology) is a neat tool for modelling morphology of languages in a computational way. The problem is that currently, the Python API is under-documented. But fear not, in this post you will learn how to load optimised lookup files in Python and use them to analyse and generate word forms. 😃
When you are targeting an international audience and you have enough money to back your project up, the thing you have to do is to localize your application. Thinking that everyone knows English, is just naive. This is a general guide that shows how the process of localization works. (more…)