Process dialectal Finnish in Python

When we Finns write online or message with friends and family, we hardly ever resort to using standard written Finnish (kirjakieli). Instead, we write as we would speak (in puhekieli), as using the standardized spelling would make our communication sound a bit too official. Now, for computers, this is quite a challenge, as most of the datasets used for NLP tools represent well curated normative text. 🤯 Luckily, as always, there's a solution for processing spoken Finnish text in Python. ☺️

Normalize dialectal Finnish in Python

Normalization is a process where dialectal text is turned into standardized written Finnish automatically. This will make higher-level NLP tasks, such as dependency parsing, easier. Start off by installing the library.

pip3 install murre
python3 -m murre.download

After the Murre library has been installed, you can use it from your Python code to process dialectal Finnish. 🐕

from murre import normalize_sentence

print(normalize_sentence("mä syön paljo karkkii".split(" ")))
>> minä syön paljon karkkia

The method normalize_sentence takes in a tokenized sentence. This means that the string is split into a list of words. You might want to take a look at Moses tokenizer for a more robust tokenization.

Cite

As always, don't forget to cite the paper if you use the library in an academic publication. ☺️

Niko Partanen, Mika Hämäläinen, and Khalid Alnajjar. 2019. Dialect Text Normalization to Normative Standard Finnish. In the Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT).