A Novel Pipeline for Domain Detection and Selecting In-domain Sentences in Machine Translation Systems

Pourmostafa Roshan Sharami, Javad; Shterionov, Dimitar; Spronck, Pieter

doi:10.6084/m9.figshare.14829030.v2

J. Pourmostafa, D. Shterionov, P. Spronck – CLIN31 – 2021_compressed.pdf (999.08 kB)

A Novel Pipeline for Domain Detection and Selecting In-domain Sentences in Machine Translation Systems

Version 2 2021-07-09, 17:24

Version 1 2021-06-23, 14:25

poster

posted on 2021-07-09, 17:24 authored by Javad Pourmostafa Roshan SharamiJavad Pourmostafa Roshan Sharami, Dimitar Shterionov, Pieter Spronck

General-domain corpora are becoming increasingly available for Machine Translation (MT) systems. However, using those that cover the same or comparable domains allow achieving high translation quality of domain-specific MT. It is often the case that domain-specific corpora are scarce and cannot be used in isolation to effectively train (domain-specific) MT systems. This work aims to improve in-domain MT by (i) a novel unsupervised pipeline for identifying distributions of different domains within a corpus and (ii) a data selection technique that leverages in-domain monolingual or parallel data to select domain-specific sentences from general corpora according to the distribution defined in (i).