Using Machine Translation to Prepare Non-English Text Data for Analysis in the Social Sciences

13-12-2024

Machine translation solutions can be powerful tools for social scientists across different disciplines. Researchers frequently have to translate their text corpora before performing further analysis because many natural language processing (NLP) models are optimised for English. Machine translation tools can facilitate this process and are easily integrated into R-based workflows.


A guest article by Sophie Jaeger, PhD researcher at Central European University

In text-as-data projects, researchers often turn to machine translation because NLP models are pre-trained primarily on English data and require large corpora of text. Hence, the sheer size of the corpus can make it physically impossible for researchers to translate their text data by hand. In my research on anti-authoritarian opposition politics in exile, I use Telegram channel data to analyse coordination between exiled Russian opposition groups. After scraping a data set of over 46,000 individual texts, I used machine translation to allow for further data transformation based on systems like named entity recognition (NER) models.

Researchers who work with the Google Cloud Translation API and the R package “googleLanguageR” can follow six best practices to translate their text data - successfully and within the constraints of their research budget.

Following these six simple steps, large corpora can be translated into English and several other languages. Among other things, this allows researchers to use sophisticated NLP models for data analysis that would not have been available for their source language otherwise.