wntm.tar.gz (64.21 kB)

Code of word network topic model

Download (64.21 kB)
posted on 05.11.2017, 03:05 by Jichang Zhao
1. Introduction

Class PrepareInput can convert the original documents into word co-occurrence network and re-weight it, then save it as pseudo documents. The output of PrepareInput can be used directly as the input of jGibbsLDA.

Class InferenceTopicsForOrgDocs is used to inference topics of original documents, after the execution of jGibbsLDA.

* Note that jGibbsLDA is a free software written by Xuan-Hieu Phan. More details can be found in http://jgibblda.sourceforge.net.

2. Installation

Straightforward Java compilation can be done with the following commands:

> tar -xzcf wntm.tar.gz
> cd wntm
> javac *.java

3. Usage

> java PrepareInput    

Example usage:
> java PrepareInput sample.txt ./ sample 10

* Note that constructing word network might require lots of memory, especially when the original input file is large. For example, original file size of 250MB needs 7GB memory in our experiment. In this case, one might use following command to configure the maximum memory can be used by Java.

> java -Xmx10g PrepareInput sample.txt ./ sample 10

4. Output

When PrepareInput completes, it will output two files. The file named with suffix ".word" stores all the nodes in the word network. The file named with suffix ".adjacent" stores pseudo documents ready to be used in jGibbsLDA.

5. Inference

Since WNTM models topics for a word's adjacent node list. Therefore, we need to inference the topic distribution of original documents, when jGibbsLDA finished training pseudo documents. 

One can use following command to do inference:
> java InferenceTopicsForOrgDocs <.words file> <.theta file output by jGibbsLDA>  
to get topics of original documents, aka the .theta file for original corpus.

6. Contact

If you have any problem, please feel free to contact Jichang Zhao(jichang@buaa.edu.cn)