Fantastic corpora and where to find them

I have compiled a list of places where one can look for corpora. This list is not limited to one language only, but rather I am listing resources that are multilingual.

When one is dealing with computational linguistics or just plain linguistics/philology, the question that arises is where to get a corpus. That's the question BA and MA thesis supervisors will continue to ask their students until they get an answer. And the question is a justified one. There are a lot of corpora out there, but how on earth are you supposed to find one?! 🤔

First stop: National academies

In many countries, there's a tradition of having a state-supported institute, an academy, that is dedicated to study the official language(s) in a given country. Here are some links to such institutes and their resources:

Kotus (Finnish)
Real Academia de la Lengua Española (Spanish)
Svenska akademien (Swedish)
Giellatekno (Small Uralic languages) (Not a national academy, though)

WordNet and its translations

Frankly, I am not a big fan of WordNet myself, but surprisingly some people do find it useful. WordNet contains semantic relations of words such as synonyms and hypernyms. This hand-crafted network is also translated into many other languages. Some of the versions in other languages are direct translations from the English one, and thus can be used directly in multilingual tools. While some WordNets are built separately and use their own synsets.

Universal dependencies with syntactic markup

Sometimes you would need a corpus that has syntactic annotations. This is especially the case for different machine learning approaches in syntactic parsing. Luckily, the universal dependencies project provides such corpora for multiple languages with unified annotations. In addition, they might contain lemmas and morphological annotations.

Parallel corpora for machine translation

A machine translation system requires parallel corpora. This means that the same text appears in multiple languages. This kind of a resource can then be used to build a statistical machine translation system with a tool such as Moses. Thanks to EU, we have Europarl in the EU languages, but there's another resource, namely Opus.

Novels, poems and such

Sometimes you don't need annotations, but rather corpora that represents a certain style of literature such as novels or poems. The go to resource for copyright free books is the Project Guttenberg, but also Wikisource has a nice collection of free-to-use literature.

Metashare

Metashare is an effort made by many universities to open their corpora for others to use. Metashare is a bit all over the place, because many universities have their own "Metashares" and in addition, the licensing may vary a lot. Some corpora are released under the public domain while some require a permission.

More collections of corpora

There are other websites apart from Metashare that list corpora made by a plethora of people and research groups. Such sites include LRE and Datahub.

Conclusion

Hopefully this list helps you get started. Just remember that corpora is everywhere! I myself have hand annotated recited poetry, and TV-shows such as the Simpsons, South Park and Archer for my research purposes. If you can't find a ready-made corpus, build one yourself! 😊

Mika Likes 👍™

Computers, humanities and beyond!™

Fantastic corpora and where to find them

First stop: National academies

WordNet and its translations

Universal dependencies with syntactic markup

Parallel corpora for machine translation

Novels, poems and such

Metashare

More collections of corpora

Conclusion

Related posts:

Usabilty and video games - a tradeoff

Process dialectal Finnish in Python

Meaning and the brain. What are concepts?