Twitter dataset
The Truth Seeker Dataset is designed to support research in the detection and classification of misinformation on social media platforms, particularly focusing on Twitter. This dataset is part of a broader initiative to enhance the understanding of how machine learning (ML) and natural language processing (NLP) can be leveraged to identify fake news and misleading content in real-time.
Dataset CompositionThe Truth Seeker Dataset comprises a substantial collection of social media posts that have been meticulously labeled as either real or fake. It was constructed using advanced ML algorithms and NLP techniques to analyze the language patterns in social media communications. The dataset includes:
- Raw Social Media Posts: A diverse range of tweets that reflect various topics and sentiments.
- Labeling: Each post is annotated with binary labels indicating its authenticity (real or fake).
- Feature Sets: Two distinct subsets of the dataset have been created using different NLP vectorization methods—Word2Vec and TF-IDF. This allows researchers to explore how different feature representations impact model performance.
Research Applications
The primary aim of the Truth Seeker Dataset is to facilitate the development and validation of models that can accurately classify social media content. Key applications include:
- Fake News Detection: Utilizing various ML algorithms, including Random Forest and AdBoost, which have demonstrated high F1 scores in preliminary evaluations.
- Model Comparison: Researchers can compare the effectiveness of different ML approaches on the same dataset, enabling a clearer understanding of which methods yield the best results in detecting misinformation.
- Algorithm Development: The dataset serves as a benchmark for developing new algorithms aimed at improving accuracy in fake news detection.