figshare
Browse

More blogging features for author identification

conference contribution
posted on 2024-03-05, 11:09 authored by Haytham Mohtasseb, Amr Ahmed

In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features. Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets.

History

School affiliated with

  • School of Computer Science (Research Outputs)

Date Submitted

2009-04-30

Date Accepted

2009-12-25

Date of First Publication

2009-12-25

Date of Final Publication

2009-12-25

Event Name

The 2009 International Conference on Knowledge Discovery (ICKD'09)

Event Dates

2009

Date Document First Uploaded

2013-03-13

ePrints ID

1862