A supervised machine learning method to classify Dutch-language news items
Please contact s.a.m.vermeer@uva.nl for questions or further information.
Background
Based on a supervised machine learning method, we developed a classifier in Python (version 3.5.2) that returns the news topic of Dutch-language news items (as a string).
To train the classifier, we collected more than 1 million news items from approximately 150 different Dutch-language news websites, as well as search engines and social media, collected over 8 months in 2017/18.
This tool can be used for mapping Dutch-news items into different news categories, namely: (1) Politics, which covers items about internal politics, international politics, and military and defense; (2) Business, includes economy, education, and health, welfare and social services; (3) Entertainment, covers sports, culture, fashion and human interest; and (4) Other, includes science and technology, environment, communication, weather and religion and beliefs.
Performance
We used three different pre-processing steps, resulting in three different .pkl modules:
(1) All text: '...text_Dutch_news.pkl',
(2) Stop word removal: '...stopword_Dutch_news.pkl', and
(3) Lead: '...lead_Dutch_news.pkl'.
For every text category, the classifier reached an accuracy, precision and recall of at least .81.
Usage
The classifiers have been developed in Python 3.5.2, and scikit-learn 0.19.2, and can be used as follows:
-- clf=joblib.load('PassiveAggressive_text_Dutch_news.pkl')
-- topic=clf.predict([text]) #text is a news item
Susan Vermeer
Damian Trilling
Sanne Kruikemeier
Claes de Vreese