wntm.rar (4.82 MB)

Datasets of word network topic model

dataset

posted on 2017-11-05, 03:01 authored by Jichang ZhaoJichang Zhao

Abstract: This dataset holds the content of one day's micro-blogs sampled from Weibo(http://weibo.com)

in the form of bags-of-words.

-----------------------------------------------------

Data Set Characteristics: Text

Number of Micro-blogs:189,223

Total Number of Words:3,252,492

Size of the Vocabulary:20,942

Associated Tasks: short text topic modeling and etc.

-----------------------------------------------------

About Preprocessing

For tokenization, we use NLPIR. Stop words and those with term-frequence less than 20 were removed. Besides,

words contain only one chinese-character were also removed.

-----------------------------------------------------

Data Format

The format of released data is setted as follows:

[document_1]

[document_2]

...

[document_M]

in which each line is one document. [document_i] is the ith document of the dataset that consists of a

list of Ni words/terms.

[document_i] = [word_i1] [word_i2] ... [word_iNi]

in which all [word_ij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character.

-----------------------------------------------------

If you have any questions about the data set, please contact: jichang@buaa.edu.cn.

History

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM