Runet - Corpus

Our corpus consists of Russian tweets and blog posts from the Russian section of the HC Corpus. This is a multilingual corpus compiled by Hans Christensen; his methodology, including language identification and subject classification, can be found on the page linked above, and he can be contacted at hc.corpus@gmail.com. After downloading the Russian corpus from this site, we further refined the corpus.

First, we collected a list of standard Russian words by identifying the distinct words in a corpus of Russian language used in the United Nations. We obtained this corpus from OPUS, which is an open-source parallel corpus that provides many texts in multiple languages. The particular part of the OPUS corpora we used was the Russian part of the MultiUN corpus which can be found here.

Next, we compared this list to the list of words from the Russian tweets and blog posts in order to compile a list of "non-standard" words. This list contained many slang and swear words, but also included a variety of technology and niche hobby-related words. So, we refined this list by hand to remove words that were standard, but not applicable to UN discussion.

Finally, we extracted those tweets containing the non-standard words we identified, and marked them up in XML. This XML file contains information about the date of the tweet, as well as tags for slang, internet slang, swear words, hashtags, and retweets. This file can be found here. We created a similar file for Russian blogs, extracting the blog posts with non-standard language. That file can be found here, but we ultimately decided to not analyze these blogs because we were not convinced of their accuracy as representative of the Russian blogosphere. All the matching blog posts were categorized as being about food, which was certainly false, some of the posts were not in Russian or English, and all of them were from either Wordpress or Blogspot, which are not nearly as popular as livejournal in Russia.