Understanding Genre in a Collection of a Million Volumes, Interim Report
One of the main problems confronting distant reading is the scarcity of metadata about genre in large digital collections. Volume-level information is often missing, and volume labels aren't in any case sufficient to guide machine reading, since poems and plays (for instance) are often mixed in a single volume, preceded by a prose introduction and followed by an index.
Our goal in this project was to show how literary scholars can use machine learning to select genre-specific collections from digital libraries. We've started by separating five broad categories that interest literary scholars: prose fiction, poetry (narrative and lyric), drama (including verse drama), prose nonfiction, and various forms of paratext.
This report discusses assumptions about the nature of genre that underpin our approach, describes methods, and explains how to use the page-level map of genre we have generated.
That map itself is also available through figshare, at http://dx.doi.org/10.6084/m9.figshare.1279201 (link below). This research was supported by the National Endowment for the Humanities and the American Council of Learned Societies. Any views, findings, conclusions, or recommendations expressed in this release do not necessarily represent those of the funding agencies.