Page-Level Genre Metadata for English-Language Volumes in HathiTrust, 1700-1922

dataset

posted on 2014-12-29, 14:36 authored by Ted UnderwoodTed Underwood

Page-by-page genre predictions for 854,476 English-language volumes printed between 1700 and 1922, keyed to the texts in HathiTrust Digital Library. This research was supported by the National Endowment for the Humanities and the American Council of Learned Societies.

The genre predictions were produced by an ensemble of regularized logistic classifiers, and are intended to support research that explores broad trends in literary history. Since volumes usually contain multiple genres, page-level metadata is necessary to create machine-readable collections in a particular genre.

Only very broad categories are discriminated here (fiction, poetry, drama, nonfiction prose, paratext). Overall average accuracy is 93.6%, but confidence metrics are included that allow researchers to trade recall for enhanced precision. For instance, the filtered subsets of fiction, poetry, and drama (included here as fiction.tar.gz, etc.) have higher than 97% precision.

Predictions are included as JSON objects in separate files, one for each volume. The tar.gz files prefixed with "all" include all 854,476 volumes, divided by date. The tar.gz files named for genres contain subsets of volumes that have been filtered to achieve greater than 97% precision in that particular genre. Specifically, they include 18,111 vols containing drama, 102,349 vols containing fiction, and 61,286 vols containing poetry. These datasets were filtered both with confidence metrics created by a logistic model and by manual editing. Ringers.csv is a list of volumes that we had to manually remove; scholars who select their own datasets from the larger collection (of files beginning "all") may also want to consider filtering out these tricky cases.

Accompanying meta.csv files provide summary volume-level metadata for each collection. For full details of methods and data format, see the interim project report at (http://dx.doi.org/10.6084/m9.figshare.1281251). For software and training data used in the project, see the repository (https://github.com/tedunderwood/genre).