figshare
Browse

BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking

Download (3.03 MB)
Version 3 2025-05-27, 18:12
Version 2 2025-05-22, 17:41
Version 1 2025-05-21, 14:12
software
posted on 2025-05-27, 18:12 authored by Alexander V. MantzarisAlexander V. Mantzaris
BenchmarkDataNLP.jl (v1.0.2) is a Julia project (can be easily used from other languages by calling Julia) that generates synthetic text datasets for natural language processing (NLP) experimentation (characters selected from the Korean Language Unicode block, Hangul). The primary goal is to allow researchers and developers to produce language-like corpora of varying sizes and complexities, without immediately investing in large-scale real-world data collection or computationally expensive training runs.
This toolbox provides multiple generation algorithms—Context-Free Grammars (CFG), RDF/Triple-store-based corpora, Finite State Machine (FSM) expansions, and Template-based text generation—each supporting a complexity parameter. You can quickly obtain controlled, structured text for model prototyping, or debugging.


History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC