Sentiment analysis under resource constraints

2017-02-16T05:14:20Z (GMT) by Andiyakkal Rajendran, Balamurali
Sentiment Analysis (SA) deals with the detection of sentiment of a textual content from a speaker’s perspective. Both supervised and unsupervised approaches exist for this task. Previous studies show that supervised approaches perform better than unsupervised approaches. However, supervised approaches heavily depend on the availability of training data. We present two resource constraints with respect to training data for SA, one in the language of operation and the other in the domain of operation. In this thesis, we propose approaches which can alleviate the problems caused by these constraints. Majority research on SA are in English. This has led to a skewness of resource development in favour of the popular language of the web. Two SA resources are i) sentiment lexicons ii) annotated corpora. In this thesis, we address the problem of unavailability or inadequacy of annotated corpora. We present an approach to leverage data from languages which have annotated data. Our approach uses wordnet sense (or otherwise known as synsets) and is based on the fact that semantics influences sentiment. We compared the results of sense based and lexeme based features for sentiment analysis in a monolingual setting. We found that sense based features perform better than lexeme based features. Also, as we move from lexeme feature space to sense feature space, dimensionality reduces. This dimensionality reduction additionally solves the data sparsity problem. As per this approach, we replace synsets not present in the test set with similar synsets from the training set using a wordnet similarity metric. A significant improvement in the classification accuracy is obtained through this approach. Sense identifiers for same concepts belonging to different languages are same if their wordnets are developed using merge method. We leverage this fact to address the problem of unavailability or inadequacy of annotated corpora in a language. A document in test set language (L_Test ) is tested for polarity through a classifier trained on sense marked and polarity labeled corpora of training language (L_Train ). We perform our experiments on two widely spoken Indian languages, Hindi and Marathi. Results show that wordnet sense can bridge the language gaps for SA. However, sense annotation is an additional task in a sentiment analysis system. Hence, to study the cost of annotation and its benefit to the end application, we introduce an economic model. Our model suggests that annotation is beneficial in terms of the performance achieved vis-a-vis the cost associated for developing the system. Existing approaches to reduce resource constraints based on the language of opera- tion depend on machine translation. However, we question the efficacy of these approaches since machine translation is very resource intensive. To test this, we convert data in a resource scarce language, RL_Test , to a resource rich language, RL_Train , using various machine translation techniques. We perform our analysis on 4 European languages (English, French, German, Russian). Our study shows that such a strategy ignores the fact that a machine translation system is much more demanding in terms of resources than a SA engine. Moreover, these approaches fail to take into account the divergence in the expression of sentiments across languages. We provide strong experimental evidence to prove that the performance of such systems comes nowhere close to that obtained by using only a few polarity annotated documents in the target language. Drop in accuracy due to a shift in domain is a common problem for all NLP tasks including sentiment analysis. To address resource constraints in the domain of operation, we present an approach for cross domain sentiment analysis. The idea is to use a group of classifiers trained on the source domain to generate noisy tagged data for the target domain. A small amount of hand-labeled target domain data is then used to decide a confidence threshold for filtering out the noise. The remaining data which is tagged with a high confidence is then used to train a high accuracy sentiment tagger for the target domain. On a training domain similar to the target domain, our system performs on par with or even better than a classifier trained using in-domain data. Thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the Indian Institute of Technology Bombay, India and Monash University, Australia.