figshare
Browse
1/1
8 files

A Large Parallel Corpus of Full-Text Scientific Articles

Version 2 2019-01-02, 20:27
Version 1 2018-11-13, 20:30
dataset
posted on 2019-01-02, 20:27 authored by Felipe SoaresFelipe Soares, Viviane Pereira Moreira, Karin Becker
NOTE FOR WMT PARTICIPANTS:
There is an easier version for MT available in Moses format (one sentence per line. The files start with moses_like.

If you use this dataset, please cite the following wordk:
@InProceedings{L18-1546,
  author = 	"Soares, Felipe
		and Moreira, Viviane
		and Becker, Karin",
  title = 	"A Large Parallel Corpus of Full-Text Scientific Articles",
  booktitle = 	"Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)",
  year = 	"2018",
  publisher = 	"European Language Resource Association",
  location = 	"Miyazaki, Japan",
  url = 	"http://aclweb.org/anthology/L18-1546"
}

We developed a parallel corpus of full-text scientific articles collected from Scielo database in the following languages: English, Portuguese and Spanish. The corpus is sentence aligned for all language pairs, as well as trilingual aligned for a small subset of sentences

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC