When we Finns write online or message with friends and family, we hardly ever resort to using standard written Finnish (kirjakieli). Instead, we write as we would speak (in puhekieli), as using the standardized spelling would make our communication sound a bit too official. Now, for computers, this is quite a challenge, as most of the datasets used for NLP tools represent well curated normative text. 🤯 Luckily, as always, there's a solution for processing spoken Finnish text in Python. ☺️
Normalize dialectal Finnish in Python
Normalization is a process where dialectal text is turned into standardized written Finnish automatically. This will make higher-level NLP tasks, such as dependency parsing, easier. Start off by installing the library.
pip3 install murre python3 -m murre.download
After the Murre library has been installed, you can use it from your Python code to process dialectal Finnish. 🐕
from murre import normalize_sentence print(normalize_sentence("mä syön paljo karkkii".split(" "))) >> minä syön paljon karkkia
The method normalize_sentence takes in a tokenized sentence. This means that the string is split into a list of words. You might want to take a look at Moses tokenizer for a more robust tokenization.
As always, don't forget to cite the paper if you use the library in an academic publication. ☺️
Niko Partanen, Mika Hämäläinen, and Khalid Alnajjar. 2019. Dialect Text Normalization to Normative Standard Finnish. In the Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT).