HTRC Extracted Features [v.0.2] `langid`
datasetposted on 16.05.2016 by Ryan Baumann
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
This archive contains the results of computing `langid` for the OCR tokens of each page in every volume of the HathiTrust HTRC Extracted Features [v.0.2] dataset. Each file is a CSV file of ISO639-1 language code and probability pairs for each page, where the filename is `[HTRC-volume-identifier].basic.json.csv`. Version 1.1.5 of `langid` was used for processing.
Warning: this archive will decompress to a ~25GB directory containing 4,805,434 files.