GCP is a two-million-word corpus of Persian containing a large body of naturally-occurring utterances and written texts. GCP represents a wide variety of Persian speakers and documents diverse language situations and uses. The corpus is composed of a balanced representation of spoken and written language. The spoken sub-corpus includes a balanced representation of non-scripted spoken utterances and scripted spoken data. The written sub-corpus, on the other hand, represents both printed and electronic written materials. The corpus further contain a diverse representation of different genres and text types.
The conversational sub-corpus is distributed through Linguistic Data Consortium available at https://catalog.ldc.upenn.edu/LDC2019T11.
The combination of three different genres also create the written Corpus of Law, Academic, and News (CLAN), which is available at https://catalog.ldc.upenn.edu/LDC2020T23.