File(s) not publicly available

GCP

online resource

posted on 2021-07-26, 18:01 authored by Ariana MohammadiAriana Mohammadi

GCP is a two-million-word corpus of Persian containing a large body of naturally-occurring utterances and written texts. GCP represents a wide variety of Persian speakers and documents diverse language situations and uses. The corpus is composed of a balanced representation of spoken and written language. The spoken sub-corpus includes a balanced representation of non-scripted spoken utterances and scripted spoken data. The written sub-corpus, on the other hand, represents both printed and electronic written materials. The corpus further contain a diverse representation of different genres and text types.

The conversational sub-corpus is distributed through Linguistic Data Consortium available at https://catalog.ldc.upenn.edu/LDC2019T11.

The combination of three different genres also create the written Corpus of Law, Academic, and News (CLAN), which is available at https://catalog.ldc.upenn.edu/LDC2020T23.