A guest article by Sophie Jaeger, PhD researcher at Central European University
In text-as-data projects, researchers often turn to machine translation because NLP models are pre-trained primarily on English data and require large corpora of text. Hence, the sheer size of the corpus can make it physically impossible for researchers to translate their text data by hand. In my research on anti-authoritarian opposition politics in exile, I use Telegram channel data to analyse coordination between exiled Russian opposition groups. After scraping a data set of over 46,000 individual texts, I used machine translation to allow for further data transformation based on systems like named entity recognition (NER) models.
Researchers who work with the Google Cloud Translation API and the R package “googleLanguageR” can follow six best practices to translate their text data - successfully and within the constraints of their research budget.

Following these six simple steps, large corpora can be translated into English and several other languages. Among other things, this allows researchers to use sophisticated NLP models for data analysis that would not have been available for their source language otherwise.