Causality Model for Text Data with a Hierarchical Topic Structure

This study describes a method for constructing a causality model from text data, such as review data. Topic modeling is useful to find these evaluation factors from text data. The method based on hierarchical latent Dirichlet allocation is useful because it automatically constructs relationships among topics. However, the depth of each topic in a hierarchical structure is the same even if the contents differ for each topic. Accordingly, the method can generate less important topics that are not worth analyzing. To solve this problem, we construct a hierarchical topic structure with different depths and more important topics by using Bayesian rose trees. In the experiment, the values of the hyperparameters for constructing a hierarchical topic structure are estimated by using evaluation indexes for causal analysis. In addition, the experiment compares the proposed method with related approaches to demonstrate the usefulness of this model.


I. INTRODUCTION
In recent years, apps for various services (e.g., Twitter and navigation), the introduction of recommended hotels, and the rise of internet shopping (e.g. Amazon) are rapidly increasing with the advancement of smartphones. Many things can now be performed online. Post evaluation about services and products can be easily conducted, and the amount of evaluation information, such as user reviews and social media for products and services, has considerably increased. Many companies, hotels, and restaurants also post reviews and evaluations about themselves online. SNSs, such as blogs and microblogs, provide evaluations of services and products to other people. Such evaluation information is used not only by consumers but also by producers to improve their services and products. Therefore, analyzing the evaluation of the service and the product is important to improve them.
A user review, as evaluation information, includes text data containing user experience and perception. The evaluation structure of products and services can be understood by analyzing the text data of reviews. Then, the content that has a large effect to evaluation in structure can be understood by causal analysis. Here, text mining is necessary to analyze text data. This mechanism can obtain valuable information from a vast amount of text data [1]. Some methods analyze text data on the basis of word co-occurrence [2]. Other methods analyze emotions [3] from text data through text mining. In addition, a topic model can extract the major theme from a group of text data.
Kunimoto et al. [4] proposed a model that predicts the purchase factors of games from text data by combining hierarchical latent Dirichlet allocation (LDA) (hLDA) [5], which is a topic model with structural equation modeling (SEM) that is used to conduct causal analysis. Their study succeeded in applying SEM to text data. Extracting topics by using hLDA can identify the evaluation factors for each analytical target. However, this previous study did not consider topic granularity. Topic granularity is the richness in content of topic, that is, it is the frequency of the topic in documents. Specifically, this factor is the importance of the topic. Topic granularity generally depends on the content of a topic and differs for each topic. However, the method that depends on hLDA does not consider topic granularity and constructs a structure with the same hierarchy regardless of the topic size. Small topics can generate a low hierarchy and large topics. Fig. 1 shows the hierarchical topic structure of hLDA. The size of the circle in Fig. 1 represents the granularity of the topic. Smaller topics, such as "salmon", compared with the same hierarchy topics, such as "mammals", can be generated because hLDA constructs a structure with the same hierarchy. Specifically, causal models can include unimportant and invaluable topics.
This research aims to solve the problem of existing studies that conduct causal analysis by using a hierarchical topic structure. To solve this problem, LDA [6] and the Bayesian rose trees (BRT) [7] are used to generate major topics with a high degree of granularity, and the topics are used to construct a hierarchical topic structure to generate hierarchical relationships by a bottom-up method. Therefore, the topic granularity of the bottom layers is higher than that of hLDA that generates a bottom topic by considering the higher topic. This structure is the image that deleted topics "salmon" and "frog" from Fig. 1, and the bottom topics are "fishes", "larva of amphibian", "birds", "mammals", and "reptiles".
Several important factors and useful evaluation structures could be discovered to improve services and products. In this study, simBRT, one of the BRTs, is used to construct a causal model. A causal analysis is conducted by using SEM.
However, the hyperparameters for constructing a hierarchical topic structure were not discussed in a study of simBRT [8]. These hyperparameters are important factors because the structure of the hierarchical topic depends on them. We estimate the value of each hyperparameter by using indexes for causal analysis (SEM) instead of topic evaluation indexes because this study aims to conduct an accurate causal analysis in the experiment. The proposed approach is compared with causal analysis that uses hLDA and SEM to confirm that the simBRT method can construct better models than the hLDA method and considers topic granularity.
The remainder of this paper is organized as follows: Section 2 presents the existing related research. Section 3 explains the BRT method, which is the core technology. Section 4 describes the analytical experiments by using actual data. Section 5 concludes this work and discusses future studies.
The contributions of this study are as follows: l This study constructs a causal model that considers topic granularity with a different layer for each topic. l This study estimates the values of hyperparameters for constructing a hierarchical topic structure by using indexes for causal analysis. l This study compares the proposed method with existing approaches to confirm the feasibility of the proposed approach in an experiment.

A. Topic Models
Topic models are algorithms for determining the major themes that pervade in a large and otherwise unstructured collection of documents. Topic models can organize such collection in accordance with the identified themes [9].
Topic models include different methods, such as latent semantic analysis (LSA) [10], LDA, and hLDA. The LDA assumes a multitopic model in which the document is based on mixed topics. LDA has a 1:n relationship between documents and topics, not 1:1, such as LSA. LDA is considered to be a natural model in documents, such as review texts that are written in one document about various aspects. HLDA is an extended method. This method can automatically construct relationships among hierarchical topics.

B. SEM
SEM [11] is a technology that is characterized by the use of factor and regression analysis. Factor analysis is a method wherein the observed variables are based on some hidden factors, and the influence of the factor is to be determined by "correlation" (variance/covariance). Regression analysis is a technique for finding the relationship between a variable to be predicted (target variable) and a variable (explanatory and independent variables) that describes the target variable.
The SEM can visually and quantitatively express causal relationships between variables by using a path model (Fig. 2). A path model consists of three elements: latent variables, observed variables, and paths. Latent variables are factors that cannot be observed in actual. Observation variables can actually be observed and are essential for estimating a latent variable. In the path model, the latent variables are represented by ellipses, and the observation variables are represented by rectangles. The causal relationship between such items is represented by the path of the arrow, and the degree of influence is denoted by the path coefficient. Therefore, in the causal analysis for SEM, the manner by which to construct a path model must be determined.

C. Related Work
Several methods can be applied to construct a path model for SEM. SERVQUAL [12] is used to construct a path model. Al-Mhasnah et al. proposed a method that uses SERVQUAL and SEM to examine the effects of the former [13]. Ali et al. improved the SERVQUAL index and analyzed it through SEM [14]. Bivina et al. used the main aspects of a pedestrian label of service (PLOS) [15] to provide a comfortable and safe walking environment. PLOS is a measurement tool for evaluating the degree of pedestrian accommodation on roadways. SEM is used to provide the essential information for interpreting the aspects of the walking environment that influence PLOS [16]. Many indicators are available; however, various services and products are difficult to measure by using one standard because of the many types of services and products, and their characteristics largely differ.
Meanwhile, another method uses a topic model to construct a path model. Saga et al. attempted to analyze the factor relationships of the game software market by using a topic model [17]. They proposed a path model generation process for SEM by using LSA and combined the text data of user reviews with the model. This method requires each document to belong to only one topic. Consequently, the model cannot express natural variables and relationships. Saga et al. extended this method to LDA and generated a path model from the topics extracted by using LDA [18]. However, LSA and LDA cannot define the relationships among topics in the learned model. To solve this problem, Kunimoto et al. [4] proposed a model that predicts the purchase factors of games from text data by combining hLDA and SEM, as mentioned in Section 1. Ogawa Identify applicable funding agency here. If none, delete this text box.  [19]. However, all topics have the same depth because these methods depend on hLDA. Therefore, the preceding studies do not consider topic granularity.
Pachinko allocation model (PAM) [20] and hierarchical PAM (hPAM) [21] are also available, in addition to hLDA. PAM documents, a mixture of distributions over a single set of topics, use a directed acyclic graph to represent topic cooccurrences. Each node in the graph is a Dirichlet distribution. HPAM is an extension of PAM. In hPAM, every node is associated with a distribution over the vocabulary. However, the hierarchical topic structures of these methods also have the same hierarchy regardless of topic granularity. A method for extracting a hierarchical topic structure that combines the biterm topic model (BTM) [22] and BRTs is also available [8]. The aforementioned method uses simBRT for considering topic similarity. This study conducts a time series analysis on the basis of the constructed hierarchical structure. Several studies have analyzed time series on the basis of a hierarchical topic structure by using BRTs. However, no study has conducted causal analysis on the basis of a hierarchical topic structure using BRTs.
In this study, causal analysis based on a model constructed using simBRT is conducted to consider topic granularity. Each document should have many words and should be characterized for SEM. Therefore, LDA is used instead of BTM because the document contains many words.

III. CONSTRUCTION MODEL USING BRTS
In this study, analysis is performed in accordance with the process shown in Fig. 3. First, a topic is extracted using the topic model. Then, the topic is represented in the hierarchical topic structure by using BRTs. Lastly, causal analysis is conducted via SEM on the basis of the model constructed using BRTs.

A. BRTs
A BRT is a probabilistic approach for hierarchical clustering and an extended method of Bayesian hierarchical clustering [23]. BRT greedily predicts a tree structure on the basis of the probability ( | ) that represents the likelihood of data given tree . In this study, the topics are used as data . All topics = { + , -, … / } extracted using LDA are the leaves.
First, each topic / is regarded as an individual tree 1 = { 1 }. BRT is repeated to combine two trees that are selected for making a new tree 2 on the basis of three basic operations (join, absorb, and collapse). Fig. 4 shows the three basic operations.
where 2 = 1 ∪ 3 are the topics under the tree structure 2 .
( 2 | 2 ) is the likelihood of topic 2 under 2 . ( 2 | 2 ) can be calculated using a dynamic programming paradigm, as follows: where F ! is the prior probability that all the topics in 2 are maintained in the same partition, and F ! is defined as follows: where F ! is the number of children of 2 , and (0 < < 1) is a hyperparameter of the model that controls the relative proportion of partitions of the data. The ( 2 ) of (2) is the marginal probability of 2 , which can be modeled by the Dirichlet compound multinomial model [24] distribution and is defined as follows: where 1 (3) is the frequency of keyword included in topic , is the total number of vocabulary, and = ( (+) , (-) , … (3) ) is the hyperparameter that specifies the distribution over the probability simplex. Γ is the gamma function, and Γ( ) = ∫ VP+ Pc e f . In addition, simBRT is used to consider topic similarity. Here, topic distribution is used as the data of a tree. Therefore, topic similarity should be considered. SimBRT considers the where ( 1 || 3 ) is the KL divergence between topics 1 and 3 , and 1/ is the topic distribution in this work. simBRT defines the weighted topic distribution in each operation to obtain topic similarity in tree construction.
where ( 2 ) is the average of the topic distribution of the topics that are included by 2 . Specifically, ( 2 ) is the topic distribution that is simply calculated on the basis of the average. Here, the final merged topic distribution under 2 is ( 2 ). Then, the topic similarity between the weighted topic and the final merged topic ( 2 ) is added into the primitive objective function in (1). Thus, (1) can be rewritten as follows: 9: 2 < 2 = 9: 1 < 1 =9? 3 A 3 B * _ ( || ( 2 )), The operation that maximizes (11) is conducted at each step to construct a hierarchical topic structure.

B. Construction of Path Model
The topics that cannot be directly observed are considered latent variables that function as correspondence between the SEM and a topic model. The keywords that comprise a topic are the observation variables because they are words that actually exist in reviews. The idea of a topic model is characterized by the generation of words by topics. Each topic is regarded as a factor, and the path from topics is drawn to the keywords to which topics are related.
Subsequently, the representation of a hierarchical topic structure is described. Some factors are considered in merging topics. When two trees are merged, a node shown as a factor is regarded as a large topic that includes two topics. This node is regarded as a topic and used as a latent variable. The paths between topics are drawn from the upper topics to the lower ones on the basis of the idea that large topics generate small topics. A path is also drawn from the top topic to rate the numerical evaluation of review data and understand the relation between topic structure and actual numeric data. Therefore, a dataset must have text data and rating evaluation expressed by a numerical value to apply this method. In addition, the number of review documents and the length of text data must not be particularly small to apply LDA.

IV. EXPERIMENT
This experiment aims to confirm the feasibility of the proposed method by constructing the model described in Section 3 and to compare the proposed approach with existing studies that used hLDA. In addition, this section considers the experimental results and discusses the hyperparameters.

A. Dataset, Indexes for Evaluation, and Hyperparamaters
In this experiment, the data should ideally have as many review data as possible to apply the topic model. The text of a review datum must include many words to characterize the statistical data on the basis of the concept of bag of words. The user reviews in the datasets published online by Kaggle and Github and Amazon are used. The reviews of airports, hotels, apps for shops, electronic services for purchasing clothes, and musical instruments are selected. Each review has a review text with a rating between 1 and 5 or 1 and 10. A review text is also regarded as a document. Only documents expressed with more than 30 words are used to ensure that the topics and the appearance frequency of the described feature words are included in each document. The app analyzes information from randomly extracted data. The number of reviews after this preprocessing is provided in Table 1.
The goodness-of-fit index (GFI), adjusted GFI (AGFI), and root mean square error of approximation (RMSEA) are adopted as indexes to evaluate the result. GFI indicates how well the total variance in the saturation model can be explained by the estimation model. A value between 0 and 1 is considered, and that close to 1 denotes a good model. A value of 0.9 or higher is desirable. GFI is unconditionally improved in fitness as a model's degree of freedom decreases. AGFI corrects the shortcomings of GFI and penalizes models with many parameters and high complexity. The same value as that in GFI is considered, and a value close to 1 indicates a good resultant model. If the model is not complex, t then the values of GFI and AGFI are close to each other. RMSEA is an index that expresses the difference between the model distribution and the actual distribution. Fit is good with a value of 0.05 or less, and if the value is 0.1 or higher. , , the number of topics and words comprise a topic are the hyperparameters. In this experiment, the number of bottom topics is 10, and that of words that comprise a topic is 5. We change hyperparameters and to observe the influence of these parameters on the performance.
Several package and libraries, namely, Python's genism for LDA [25], Mallet package for hLDA [26], and the SEM package of R for SEM analysis, are used in this experiment [27].

B. Results
Fig . 5 shows the result of the evaluation indexes for each hyperparameter for the airport data and analyzed models. We fix one of the indexes to 0.1 to observe the influence of one parameter and understand that the evaluation indexes are best when = 0.2, 0.3, 0.4 and = 0.5, 0.6 . Table 2 shows the result of the evaluation indexes of these parameters, and the best performance is = 0.2 and = 0.6. Table 1 presents the calculation results of the evaluation indexes for each data and analyzed models when is 0.2 and is 0.6. In the table, all the models have a GFI and AGFI of over 0.9. Moreover, the models have RMSEA values of less than 0.05. The BRT results of all the models, except for the RMSEA of ecommerce and the GFI of airport, have higher values than the hLDA results. In the hLDA model, 10 topics are obtained from the bottom layer and the keywords that comprise a topic at the upper layer are deleted to achieve the same situation as that in the BRT model.
For example, Fig. 6 shows the analysis result model of the airport dataset. The causal relation between keywords that comprise a topic is presented similar to that in Fig. 7.
The words at the bottom of the model are those that make up the identified topics from the text data of the review by using the topic extraction with LDA. Here, the contents of the topics (latent variables) are estimated by authors from the words that make up each topic. For example, "access" is estimated by different access features, namely, "car", "taxi", and "train". The causal relationships can be analyzed by paying attention to the arrow and values calculated by SEM between topics or between topics and words at the bottom of the model. This study focuses on the topics "airport service" and "country". Accordingly, we can understand that these topics have topics of deep hierarchy, such as "flight" and "Asian country". Topics "flight" and "empathy" have topics of deeper hierarchy, such as "procedure". In this way, this method can construct a hierarchical structure with different hierarchies for each topic. Next, we focus on the path from "airport" to understand an important factor. "Airport condition" and "airport service" are important factors because they have large path coefficients. This study focuses on the path from these topics to understand many detailed important factors. Accordingly, the important factors that have a large effect to evaluation are understood by focusing on the path and path coefficient.
The airport structure can be reviewed by examining the results of the analysis of airport data. Airports are evaluated using certain topics, such as "airport condition" and "airport   209 service". This study focuses on low hierarchy to analyze the details of the evaluation factors, such as "intern procedure" and "access".
V. CONCLUSION In this study, a hierarchical topic structure was represented by BRTs on the basis of a topic extracted from text data. A path model of SEM was also constructed on the basis of this hierarchical topic structure, and a causal analysis was conducted. In the experiment, the values of hyperparameters and are estimated by using evaluation indexes for the SEM. The value of our proposed method was demonstrated by the result of an experiment that used reviews of apps, hotels, airports, musical instruments, and e-commerce.
In existing causal analyses that used hLDA, all topics have the same depth of layer, and this method cannot consider topic granularity. On this basis, a hierarchical topic structure is constructed using simBRT on the basis of the topics of LDA. This method can construct a hierarchical topic structure with different hierarchies for each topic by using the topics that have large granularity. In the experiment, several services and products were analyzed to confirm the feasibility of the proposed method, and the values of hyperparameters are discussed. In each index, satisfaction values were found for all the datasets. The result of the developed method was compared with that of the analysis that used hLDA. The former was found to be a better model. In addition, the services and products can be visually and quantitatively evaluated using the proposed model (Fig. 6).
In the future, a topic is defined as a bag of words without explicit semantics. In this study, the contents of the topics are estimated using the words that compose them. However, the topic model loses objectivity. We, we can use topic labeling to address this issue.