sorry, we can't preview this file

...but you can still download General_Zhihu_Corpus.zip
General_Zhihu_Corpus.zip (218.32 MB)

General Zhihu Corpus

Download (218.32 MB)
dataset
posted on 15.05.2019, 23:50 by Andrew IrelandAndrew Ireland

Chinese language corpus containing 3,434 questions and 231,939 answers posted to Zhihu.com.


Questions taken from 10 popular topics:

“Culture” (文化), “Education” (教育), “Art” (艺术), “University” (大学), “The Internet” (互联网), “Psychology” (心理), “Technology” (科技), “Health” (健康), “Career Development” (职业发展), “Lifestyle” (生活方式)


Includes R scripts used to extract data.

Data extracted in April 2019.


Files are questions (Q), answers (A) and question topics (T).

The naming convention is the URL of the webpage:

For questions:

https://www.zhihu.com/question/[question number]

For answers:

https://www.zhihu.com/question/[question number]/answer/[answer number]


Answers are organised by author category: "male", "female", "undisclosed gender", "anonymous", "organisation" using information from the user's profile where publicly accessible.


Short Answers: ≤1,000 characters

Medium Answers: 1,001-4,999 characters

Long Answers: ≥5,000 characters


History

Usage metrics

Licence

Exports