GCP is a two-million-word corpus of Persian containing a large body
of naturally-occurring utterances and written texts. GCP represents a
wide variety of Persian speakers and documents diverse language
situations and uses. The corpus is composed of a balanced representation
of spoken and written language. The spoken sub-corpus includes a
balanced representation of non-scripted spoken utterances and scripted
spoken data. The written sub-corpus, on the other hand, represents both
printed and electronic written materials. The corpus further contain a
diverse representation of different genres and text types.
The
conversational sub-corpus is distributed through Linguistic Data
Consortium available at https://catalog.ldc.upenn.edu/LDC2019T11.
The
combination of three different genres also create the written Corpus of
Law, Academic, and News (CLAN), which is available at
https://catalog.ldc.upenn.edu/LDC2020T23.