Datasets of word network topic model

2017-11-05T03:01:38Z (GMT) by Jichang Zhao
<div>Abstract: This dataset holds the content of one day's micro-blogs sampled from Weibo(http://weibo.com) </div><div>in the form of bags-of-words.</div><div><br></div><div>-----------------------------------------------------</div><div><br></div><div>Data Set Characteristics: Text</div><div>Number of Micro-blogs:189,223</div><div>Total Number of Words:3,252,492</div><div>Size of the Vocabulary:20,942</div><div>Associated Tasks: short text topic modeling and etc.</div><div><br></div><div>-----------------------------------------------------</div><div><br></div><div>About Preprocessing</div><div><br></div><div>For tokenization, we use NLPIR. Stop words and those with term-frequence less than 20 were removed. Besides,</div><div>words contain only one chinese-character were also removed.</div><div><br></div><div>-----------------------------------------------------</div><div><br></div><div>Data Format</div><div><br></div><div>The format of released data is setted as follows:</div><div><br></div><div>[document_1]</div><div>[document_2]</div><div>...</div><div>[document_M]</div><div><br></div><div>in which each line is one document. [document_i] is the ith document of the dataset that consists of a </div><div>list of Ni words/terms.</div><div><br></div><div>[document_i] = [word_i1] [word_i2] ... [word_iNi]</div><div><br></div><div>in which all [word_ij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character.</div><div><br></div><div>-----------------------------------------------------</div><div><br></div><div>If you have any questions about the data set, please contact: jichang@buaa.edu.cn.</div>