Machine Translation as an Underrated Ingredient? Solving Classification Tasks with Large Language Models for Comparative Research

Mate, Akos; Sebok, Miklos; Wordliczek, Lukasz; Stolicki, Dariusz; Feldmann, Adam

doi:10.6084/m9.figshare.24025845.v1

1/1

2 files

Machine Translation as an Underrated Ingredient? Solving Classification Tasks with Large Language Models for Comparative Research

journal contribution

posted on 2023-08-24, 15:55 authored by Akos MateAkos Mate, Miklos SebokMiklos Sebok, Lukasz Wordliczek, Dariusz Stolicki, Adam Feldmann

While large language models have revolutionised computational text analysis methods, the field is still tilted towards English language resources. Even as there are pre-trained models for some "smaller" languages, the coverage is far from universal, and pre-training large language models is an expensive and complicated task. This uneven language coverage limits comparative social research in terms of its geographical and linguistic scope. We propose a solution that sidesteps these issues by leveraging transfer learning and open-source machine translation. We use English as a bridge language between Hungarian and Polish bills and laws to solve a classification task related to the Comparative Agendas Project (CAP) coding scheme.

Using the Hungarian corpus as training data for model fine-tuning, we categorise the Polish laws into 20 CAP categories. In doing so, we compare the performance of Transformer-based deep learning models (monolingual, such as BERT, and multilingual such as XLM-RoBERTa) and machine learning algorithms (e.g., SVM). Results show that the fine-tuned large language models outperform the traditional supervised learning benchmarks but are themselves surpassed by the machine translation approach. Overall, the proposed solution demonstrates a viable option for applying a transfer learning framework for low-resource languages and achieving state-of-the-art results without requiring expensive pre-training.