Text mining and rating prediction with topical user models

2017-02-17T04:02:04Z (GMT) by Seroussi, Yanir
Recent years have seen an abundance of user-generated texts published online. Mining these texts for useful information is a growing research area with many aspects that are yet to be fully explored. Two such aspects, which are investigated in this thesis, are the extraction of implicit information about users to create user models, and the application of these models to tasks that require user information. Our main approach to extracting user information is via topical user models, which represent each author and document with low-dimensional distributions over topics (a topic is a distribution over words). We develop methods that utilise these topical user models to address the following tasks: (1) authorship attribution: identifying which user wrote a given anonymous text; (2) polarity inference: detecting the level of sentiment expressed in a given text; and (3) rating prediction: determining a given user's expected sentiment towards a given item. <br> The first task we consider is authorship attribution, where the goal is to identify the authors of anonymous texts. Authorship attribution is one of the most commonly attempted tasks in the authorship analysis field, which -- in addition to authorship attribution -- also deals with profiling authors by inferring demographic information and personality traits from their texts. Traditionally, research in this field has focused on formal texts, such as essays and novels, but recently more attention has been given to online user-generated texts, such as emails and blogs. Authorship attribution of online user-generated texts is a more challenging task than traditional authorship attribution, because such texts tend to be short and informal, and the number of candidate authors is often larger than in traditional settings. We address this challenge by employing topical user models. In addition to exploring novel ways of applying two popular topic models to this task, we develop a new model that projects users and documents to two disjoint topic spaces. Employing our model in authorship attribution yields state-of-the-art performance on several datasets, which contain either formal texts or online user-generated texts, where the number of candidate authors ranges from three to about 20,000. <br> The second task we consider is polarity inference, where the goal is to infer the degree of positive or negative sentiment expressed in texts. Polarity inference is a key task in the sentiment analysis field, which deals with inferring people's sentiments and opinions from texts. Even though the way polarity is expressed often appears to depend on the author, most of the work in this field ignores authors. In this thesis, we introduce a framework that infers the polarity of texts by employing user-specific inference models, where the models can be weighted according to user similarity. We show that our framework outperforms two popular baselines, even when all the base models are given equal weights. In addition, we show that performance can be further improved by considering user similarity in terms of language use (e.g., as captured by topical user models) and rating patterns. <br> The third and final task we consider is rating prediction, where the goal is to predict the rating a given user would assign to a given item. Rating prediction is a core component of many recommender systems, which require a way to predict users' future sentiments in order to find and recommend items of personal interest. Recently, rating prediction algorithms that are based on matrix factorisation have become increasingly popular, mainly due to their high accuracy and scalability. However, such algorithms often deliver inaccurate rating predictions for users who submitted only a few ratings. In this thesis, we introduce an extension to the basic matrix factorisation algorithm that considers information about the users when generating rating predictions. We show that employing either demographic information or text-based information (in the form of topical user models) outperforms baselines that consider only ratings, thereby enabling more accurate generation of personalised rating predictions for users who have not submitted many ratings. In the case of topical user models, these predictions are generated without requiring users to explicitly supply any information about themselves and their preferences. <div><br></div><div>Awards: Winner of the Mollie Holman Doctoral Medal for Excellence, Faculty of Information Technology, 2012.</div>