figshare
Browse

File(s) not publicly available

GCP

online resource
posted on 2021-07-26, 18:01 authored by Ariana MohammadiAriana Mohammadi
GCP is a two-million-word corpus of Persian containing a large body of naturally-occurring utterances and written texts. GCP represents a wide variety of Persian speakers and documents diverse language situations and uses. The corpus is composed of a balanced representation of spoken and written language. The spoken sub-corpus includes a balanced representation of non-scripted spoken utterances and scripted spoken data. The written sub-corpus, on the other hand, represents both printed and electronic written materials. The corpus further contain a diverse representation of different genres and text types.
The conversational sub-corpus is distributed through Linguistic Data Consortium available at https://catalog.ldc.upenn.edu/LDC2019T11.
The combination of three different genres also create the written Corpus of Law, Academic, and News (CLAN), which is available at https://catalog.ldc.upenn.edu/LDC2020T23.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC