figshare
Browse

Geocoding the Past World

software
posted on 2025-04-04, 21:02 authored by Yuqi ChenYuqi Chen

Geocoding the Past World: Unearthing Coordinates of Early China from Texts Using Generative AI

Yuqi Chen, Wenyi Shang, Hongsu Wang, Sophia Zhang, Peter K. Bol

Extracting geographic information from historical texts presents unique challenges. To address these challenges, this study leverages generative large language models to extract historical toponyms and their corresponding location references from texts. The coordinates of the extracted toponyms are then identified by a historical geocoder, which also calculates their maximum error distances based on the location references, indicating the degree of uncertainty. Both the extraction and geocoding processes are integrated into a novel tool named ‘His-Geo’. To evaluate the results, this study also curates a manually annotated dataset, the Early China Historical Geographic Corpus (CHGC-Early), filling the gap in the absence of geographic data for early China in existing gazetteers and providing a benchmark dataset for training and evaluating approaches for tasks related to geographic information extraction from premodern Chinese texts. The evaluation results show a satisfactory 0.831 F1 score for the GPT-4o model, demonstrating the remarkable capability of generative large language models in extracting geographic information from lengthy, unstructured texts that encompass diverse and sometimes conflicting views.



Funding

The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project Reference Number: AoE/B-704/22-R).

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC