figshare
Browse
.DB
en_pt_publish.db (507.48 MB)
ARCHIVE
Moses_Like_ENPT.tar.gz (154.71 MB)
1/0
2 files

A Parallel Corpus of Thesis and Dissertations Abstracts

Version 2 2019-01-21, 09:04
Version 1 2018-11-13, 20:30
dataset
posted on 2019-01-21, 09:04 authored by Felipe SoaresFelipe Soares
NOTE FOR WMT PARTICIPANTS:
There is an easier version for MT available in Moses format (one sentence per line. The files start with moses_like.

If you use this dataset, please cite the following work:
@inproceedings{soares2018parallel,
  title={A Parallel Corpus of Theses and Dissertations Abstracts},
  author={Soares, Felipe and Yamashita, Gabrielli Harumi and Anzanello, Michel Jose},
  booktitle={International Conference on Computational Processing of the Portuguese Language},
  pages={345--352},
  year={2018},
  organization={Springer}
}


In Brazil, the governmental body responsible for overseeing and coordinating post-graduate pro-grams, CAPES, keeps records of all thesis and dissertations presented in the country. Informa-tion regarding such documents can be accessed online in the Thesis and Dissertations Catalog(TDC), which contains abstracts in Portuguese and English, and additional data regarding suchdocuments. Thus, this database can be a potential source of parallel corpora for the Portugueseand English languages. In this article, we present the development of a parallel corpus from TDC,which is made available by CAPES under the open data initiative. Approximately 240,000 doc-uments were collected and aligned using the Hunalign algorithm. We demontrate the capabilityof our developed corpus by training Statistical Machine Translation (SMT) and Neural MachineTranslation (NMT) models for both language directions, followed by a comparison with GoogleTranslator (GT). Our both translation models presented better BLEU scores than GT, with NMTsystem being the most accurate one. Sentence alignment was also manually evaluated, presentingan average of XX% correctly aligned sentences. Our parallel corpus is freely available in TMXformat, with complementary infomration regarding document metadata.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC