1/1
4 files

A supervised machine learning method to classify Dutch-language news items

dataset
posted on 08.11.2018, 14:02 by Susan VermeerSusan Vermeer

Please contact s.a.m.vermeer@uva.nl for questions or further information.


Background

Based on a supervised machine learning method, we developed a classifier in Python (version 3.5.2) that returns the news topic of Dutch-language news items (as a string).

To train the classifier, we collected more than 1 million news items from approximately 150 different Dutch-language news websites, as well as search engines and social media, collected over 8 months in 2017/18.

This tool can be used for mapping Dutch-news items into different news categories, namely: (1) Politics, which covers items about internal politics, international politics, and military and defense; (2) Business, includes economy, education, and health, welfare and social services; (3) Entertainment, covers sports, culture, fashion and human interest; and (4) Other, includes science and technology, environment, communication, weather and religion and beliefs.


Performance

We used three different pre-processing steps, resulting in three different .pkl modules:

(1) All text: '...text_Dutch_news.pkl',

(2) Stop word removal: '...stopword_Dutch_news.pkl', and

(3) Lead: '...lead_Dutch_news.pkl'.

For every text category, the classifier reached an accuracy, precision and recall of at least .81.


Usage

The classifiers have been developed in Python 3.5.2, and scikit-learn 0.19.2, and can be used as follows:

-- clf=joblib.load('PassiveAggressive_text_Dutch_news.pkl')

-- topic=clf.predict([text]) #text is a news item


Susan Vermeer

Damian Trilling

Sanne Kruikemeier

Claes de Vreese

History