figshare
Browse

AH&AITD – Arslan’s Human and AI Text Database

Download (20.42 MB)
dataset
posted on 2025-05-24, 19:01 authored by Arslan AkramArslan Akram
<p dir="ltr">AH&AITD is a comprehensive benchmark dataset designed to support the evaluation of AI-generated text detection tools. The dataset contains <b>11,580 samples</b> spanning both <b>human-written</b> and <b>AI-generated</b> content across multiple domains. It was developed to address limitations in previous datasets, particularly in terms of diversity, scale, and real-world applicability. To facilitate research in the detection of AI-generated text by providing a diverse, multi-domain dataset. This dataset enables fair benchmarking of detection tools across various writing styles and content categories.</p><h2><b>Composition</b></h2><h3><b>1. Human-Written Samples (Total: 5,790)</b></h3><p dir="ltr">Collected from:</p><ul><li><b>Open Web Text</b> (2,343 samples)</li><li><b>Blogs</b> (196 samples)</li><li><b>Web Text</b> (397 samples)</li><li><b>Q&A Platforms</b> (670 samples)</li><li><b>News Articles</b> (430 samples)</li><li><b>Opinion Statements</b> (1,549 samples)</li><li><b>Scientific Research Abstracts</b> (205 samples)</li></ul><h3><b>2. AI-Generated Samples (Total: 5,790)</b></h3><p dir="ltr">Generated using:</p><ul><li><b>ChatGPT</b> (1,130 samples)</li><li><b>GPT-4</b> (744 samples)</li><li><b>Paraphrase Models</b> (1,694 samples)</li><li><b>GPT-2</b> (328 samples)</li><li><b>GPT-3</b> (296 samples)</li><li><b>DaVinci (GPT-3.5 variant)</b> (433 samples)</li><li><b>GPT-3.5</b> (364 samples)</li><li><b>OPT-IML</b> (406 samples)</li><li><b>Flan-T5</b> (395 samples)</li></ul><p dir="ltr"><b>Citation:</b></p><p dir="ltr">Akram, A. (2023). <i>AH&AITD: Arslan’s Human and AI Text Database</i>. [Dataset]. Associated with the article: <i>An Empirical Study of AI-Generated Text Detection Tools</i>. Advances in Machine Learning & Artificial Intelligence, 4(2), 44–55.</p>

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC