figshare
Browse
Towards a Multilingual Aligned Parallel Corpus.pdf (229.2 kB)

Towards a Multilingual Aligned Parallel Corpus

Download (229.2 kB)
journal contribution
posted on 2016-11-22, 19:40 authored by Imad ZeroualImad Zeroual
Nowadays, there are a large number of satisfying studies on monolingual corpora and the amount of its available data grew significantly over the last years. Unfortunately, not all types of corpora have benefited equally from this growth. An example of such corpora is the multilingual aligned parallel corpus, where there are just a few cases in the cross-language research area. Thus, the goal behind this work is to produce a new aligned multilingual parallel corpus and increase the amount of work in being carried out on the building of such corpora. In this paper, we highlight ongoing work of creating a multilingual aligned parallel corpus of subtitles from TEDx Talks events. The corpus currently contains roughly 6,000 multilingual of aligned subtitles covering 200 video talks in different languages (Arabic, English, French, Spanish, Italian, etc) and it covers a variety of topics such as Business, Education, Environment, etc. Our corpus is divided into two sub corpora. The first one contains about 200 files for each 15 languages and the second one is available in 30 languages with an average size of roughly 100 files per language.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC