wntm.tar.gz (64.21 kB)

Code of word network topic model

Name: Code of word network topic model
Published: 2017-11-05T03:05:50+00:00
License: https://creativecommons.org/licenses/by/4.0/
Keywords: topic model, short texts, imbalanced texts, Natural Language Processing

dataset

posted on 2017-11-05, 03:05 authored by Jichang ZhaoJichang Zhao

1. Introduction
---------------

Class PrepareInput can convert the original documents into word co-occurrence network and re-weight it, then save it as pseudo documents. The output of PrepareInput can be used directly as the input of jGibbsLDA.

Class InferenceTopicsForOrgDocs is used to inference topics of original documents, after the execution of jGibbsLDA.

* Note that jGibbsLDA is a free software written by Xuan-Hieu Phan. More details can be found in http://jgibblda.sourceforge.net.

2. Installation
---------------

Straightforward Java compilation can be done with the following commands:

> tar -xzcf wntm.tar.gz
> cd wntm
> javac *.java

3. Usage
--------

> java PrepareInput    

Example usage:
> java PrepareInput sample.txt ./ sample 10

* Note that constructing word network might require lots of memory, especially when the original input file is large. For example, original file size of 250MB needs 7GB memory in our experiment. In this case, one might use following command to configure the maximum memory can be used by Java.

> java -Xmx10g PrepareInput sample.txt ./ sample 10

4. Output
---------

When PrepareInput completes, it will output two files. The file named with suffix ".word" stores all the nodes in the word network. The file named with suffix ".adjacent" stores pseudo documents ready to be used in jGibbsLDA.

5. Inference
------------

Since WNTM models topics for a word's adjacent node list. Therefore, we need to inference the topic distribution of original documents, when jGibbsLDA finished training pseudo documents. 

One can use following command to do inference:
> java InferenceTopicsForOrgDocs <.words file> <.theta file output by jGibbsLDA>  
to get topics of original documents, aka the .theta file for original corpus.

6. Contact
----------

If you have any problem, please feel free to contact Jichang Zhao(jichang@buaa.edu.cn)

History

Usage metrics

Keywords

topic model short texts imbalanced texts Natural Language Processing

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM