Mining online diaries for blogger identification
In this paper, we present an investigation of authorshipidentification on personal blogs or diaries, which are different from other types of text such as essays, emails, or articles based on the text properties. The investigation utilizes couple of intuitive feature sets and studies various parameters that affect the identification performance.Many studies manipulated the problem of authorshipidentification in manually collected corpora, but only fewutilized real data from existing blogs. The complexity ofthe language model in personal blogs is motivating toidentify the correspondent author. The main contributionof this work is at least three folds. Firstly, we utilize the LIWC and MRC feature sets together, which have beendeveloped with Psychology background, for the first timefor authorship identification on personal blogs. Secondly, we analyze the effect of various parameters, and feature sets, on the identification performance. This includes the number of authors in the data corpus, the post size or the word count, and the number of posts for each author. Finally, we study applying authorship identification over a limited set of users that have a common personality attributes. This analysis is motivated by the lack of standard or solid recommendations in literature for such task, especially in the domain of personal blogs.The results and evaluation show that the utilized featuresare compact while their performance is highly comparablewith other larger feature sets. The analysis also confirmedthe most effective parameters, their ranges in the datacorpus, and the usefulness of the common users classifierin improving the performance, for the author identificationtask.
History
School affiliated with
- School of Computer Science (Research Outputs)