A supervised machine learning method to classify Dutch-language news items
Please contact firstname.lastname@example.org for questions or further information.
Based on a supervised machine learning method, we developed a classifier in Python (version 3.5.2) that returns the news topic of Dutch-language news items (as a string).
To train the classifier, we collected more than 1 million news items from approximately 150 different Dutch-language news websites, as well as search engines and social media, collected over 8 months in 2017/18.
This tool can be used for mapping Dutch-news items into different news categories, namely: (1) Politics, which covers items about internal politics, international politics, and military and defense; (2) Business, includes economy, education, and health, welfare and social services; (3) Entertainment, covers sports, culture, fashion and human interest; and (4) Other, includes science and technology, environment, communication, weather and religion and beliefs.
We used three different pre-processing steps, resulting in three different .pkl modules:
(1) All text: '...text_Dutch_news.pkl',
(2) Stop word removal: '...stopword_Dutch_news.pkl', and
(3) Lead: '...lead_Dutch_news.pkl'.
For every text category, the classifier reached an accuracy, precision and recall of at least .81.
The classifiers have been developed in Python 3.5.2, and scikit-learn 0.19.2, and can be used as follows:
-- topic=clf.predict([text]) #text is a news item
Claes de Vreese