figshare
Browse
1/1
17 files

LODCat Supplementary Material

Download all (186.04 MB) This item is shared privately
online resource
modified on 2023-05-30, 15:46

This data is the supplementary material for our submission "A Topic Model for the Data Web". It mainly comprises the results of the three experiments in three directories. We will explain their content in the following.


The code of our approach LODCat can be found at https://anonymous.4open.science/r/lodcat-2CE3/


The 623927 real-world RDF datasets gathered from the LOD Laundromat project in January 2018 are available at https://figshare.com/s/af7f18a7f3307cc86bdd.


Input Data


- laundromat-datasets-per-namespace.csv: The comparison of number of datasets (1st column) and the count of namespaces that are used in this number of datasets (2nd column).

- laundromat-triples-per-dataset.csv: This file contains the triples per input RDF dataset sorted in descending order.

- wiki-corpus.tar.xz: The corpus comprising the Wikipedia subset used for our experiments. The files represent a corpus in the format used by the Gensim framework.


Experiment 1


The first experiment comprises the application of the topic modeling inference on the Wikipedia corpus. The main results of the evaluation are listed in the following:


- collected_C_P.tar.gz: This directory contains the C_P coherence values of the single topic models (one file per model).

- collected_C_V: This directory contains the C_V coherence values of the single topic models (one file per model).

- coherence_ranking_summary.csv: This file contains the ranking of the single topic models according to the two coherence measures. It shows that the best overall rank is achieved by the model 115_1_1643134253_f78DwtJ60i.csv, which achieved the second best rank in both rankings.

- lodcat-bestmodel-CP.csv: This file contains the coherence values of the chosen model sorted in descending order. The third and fourth column contain the same values but are split to separate good and bad topics (needed to generate the diagram in the PDF).

- lodcat-bestmodel-CV.csv: This file contains the coherence values of the chosen model sorted in descending order. The third and fourth column contain the same values but are split to separate good and bad topics (needed to generate the diagram in the PDF).

- top_words.csv: This file contains the 10 top words per topic together with the probability of the word within the topic's distribution. The topic IDs of this file are used as identifiers in during the other experiments.


Experiment 2


The second experiment focuses on the application of LODCat on the input RDF datasets based on the chosen topic model.


- lodcat-good-topic-dataset-count.csv: The counts how often a good topic has been assigned as the most important topic of a dataset. The first column is the rank, the second is the count, the third is the topic ID.

- lodcat-topic-dataset-count.csv: The counts how often a topic (including bad topics) has been assigned as the most important topic of a dataset. The columns have the following meaning: rank, topic ID, quality (g=good,b=bad), the count of datasets, yg and yb repeat the count in case the topic is good or bad, respectively. These last two columns were used to generate a diagram and do not add additional information.

- topic-size-comparison.ods: A comparison of topic sizes based on the relative number of single tokens assigned to the topics within the two large corpora (Wikipedia vs. the RDF-dataset-based corpus). The comparison is based on the assumption that the Wikipedia represents a "general" distribution of topics and that large differences between the amount of tokens assigned to a topic points to an "over-" or "under-represented" topic. The token counts of the next two files have been used to create this comparison.

- topic-size-laundromat.csv: the number of tokens that are assigned to the single topics based on sampling a topic for each token for the single documents derived from the RDF datasets.

- topic-size-wikipedia.csv: the number of tokens that are assigned to the single topics based on sampling a topic for each token for the single wikipedia articles.


Experiment 3


- Survey_988459_LODCat_2022.pdf: The summary of the survey as PDF. The labels of the questions represent the ranking of the answers (1=1st topic, 2=2nd topic, 3=3rd topic, I=intruder topic; the labels were not visible to the survey participants).

- topic-log-odds.csv: This file comprises the data to calculate the topic log odds statistics. The columns have the following content: Question ID; RDF dataset; count of answers for 1st, 2nd, 3rd and intruder topic; sum of votes; probability of 1st, 2nd, 3rd and intruder topic according to the topic model; difference of log probabilities between the intruder topic and the other 2 or 3 topcs; weighted difference, i.e., the difference is multiplied with the number of votes for this topic; the arithemtic average of the weighted difference, i.e., the topic log odds.