HTRC Extracted Features [v.0.2] `langid`

dataset

posted on 2016-05-16, 20:46 authored by Ryan BaumannRyan Baumann

This archive contains the results of computing `langid` for the OCR tokens of each page in every volume of the HathiTrust HTRC Extracted Features [v.0.2] dataset. Each file is a CSV file of ISO639-1 language code and probability pairs for each page, where the filename is `[HTRC-volume-identifier].basic.json.csv`. Version 1.1.5 of `langid` was used for processing.

Warning: this archive will decompress to a ~25GB directory containing 4,805,434 files.

History

Usage metrics

Keywords

langid htrc hathitrust Digital Humanities Linguistics Literature Language Natural Language Processing

Licence

CC BY 4.0

HTRC Extracted Features [v.0.2] `langid`

History

Usage metrics

Categories

Keywords

Licence

Exports