General Zhihu Corpus

posted on 15.05.2019, 23:50 by Andrew IrelandAndrew Ireland

Chinese language corpus containing 3,434 questions and 231,939 answers posted to

Questions taken from 10 popular topics:

“Culture” (文化), “Education” (教育), “Art” (艺术), “University” (大学), “The Internet” (互联网), “Psychology” (心理), “Technology” (科技), “Health” (健康), “Career Development” (职业发展), “Lifestyle” (生活方式)

Includes R scripts used to extract data.

Data extracted in April 2019.

Files are questions (Q), answers (A) and question topics (T).

The naming convention is the URL of the webpage:

For questions:[question number]

For answers:[question number]/answer/[answer number]

Answers are organised by author category: "male", "female", "undisclosed gender", "anonymous", "organisation" using information from the user's profile where publicly accessible.

Short Answers: ≤1,000 characters

Medium Answers: 1,001-4,999 characters

Long Answers: ≥5,000 characters


