figshare
Browse

sorry, we can't preview this file

Corpus of German-Language Fiction.zip (408.81 MB)

Corpus of German-Language Fiction (txt)

Download (408.81 MB)
dataset
posted on 2017-01-06, 14:44 authored by Frank FischerFrank Fischer, Jannik Strötgen
Summary

Contains 2,735 German-language prose works (mainly novels and short stories) by 549 authors, spanning from ca. 1510 to the 1940s (the bulk of texts is from 1840–1930). This makes for 937,8 MB uncompressed literary data. Texts were extracted from the Gutenberg-DE Edition 13 DVD-ROM (released in November, 2013) and converted from HTML to TXT. This corpus is meant for research purposes only, primarily for reproducing our results. There is no metadata other than author's name – title – year. All this info is included in the filename of the 2,735 files, like in this example: "Georg_Büchner_-_Lenz_(1837).txt". – In addition to the corpus with German originals, the second folder contains 484 texts from 18 foreign-language authors (translated into German from English, French, and Russian). Filenames in this folder don't contain information on year of publication. Sums up to another 176,2 MB of uncompressed textual data.

Instructions

Download the ZIP and unpack.

Some Notes on the Derivation Procedure

The Gutenberg-DE DVD (see http://gutenberg.spiegel.de/) in its original state contains all sorts of German-language texts whose copyright has expired, from all sorts of genres, including translations. Our goal was to derive a subcorpus of German-language fiction for a project on temporal expressions in literary texts. (project page: http://dbs.ifi.uni-heidelberg.de/index.php?id=jahrestage)

First, we used existing metadata and extracted all texts marked with content="{fiction,narrative,novelette}" in the meta tags. We then eliminated all translations (i.e., all texts with translators mentioned in meta tags). After that, we weren't even halfway there and had to pull out the hitherto undetected fictional texts with other methods, or just manually.

For our purposes, we just needed running text (from which we wanted to extract temporal expressions using HeidelTime) and just basic metadata like year of publication/writing to measure changes in time. Speaking of which, years of publication were either taken from the original Gutenberg-DE metadata (which might be wrong in some cases, and still might be wrong in our filenames) or were added manually when missing. In each case, we used exactly one year number. E.g., the aforementioned narrative "Lenz" by Georg Büchner was only published in 1839, but we opted for 1837 (year of Büchner's death). For our purpose we just needed a rough indication of when a text was written or published.

Please be aware of these limitations when you're taking the corpus out for own test runs. Some other known issues are mentioned below. Since our research was finished some time ago, we don't feel like correcting or enriching our working corpus. We know that other groups are working on a much better method to derive texts from Gutenberg-DE, even preserving some of the markup and enriching the metadata.

Additional Subcorpus Containing Foreign-Language Authors in German Translation

We didn't really find a suitable corpus of foreign-language fiction to hold our findings against. We checked some candidates, like Project Gutenberg (gutenberg.org) and Cervantes Virtual, but cutting out a subcorpus of strictly fictional texts would have caused similar issues. But what we did do is toy around with a subcorpus of 484 selected texts from 18 foreign-language authors translated to German (from English, French, and Russian). One thing we learned from this subcorpus is that Jules Verne very well qualifies as the "Calendar Boy of World Literature" (see https://twitter.com/umblaetterer/status/616782492859641856).

Known Issues of German-Fiction Corpus

- Duplicate: Knigge's "Benjamin Noldmann's Geschichte der Aufklärung in Abyssinien" is contained twice (additional occurrence as part of "Traum des Herrn Brick").
- Duplicate: E.T.A. Hoffmann's "Meister Martin der Küfner und seine Gesellen" is already contained in "Serapions-Brüdern".
- Semi-Duplicate: Gottfried Keller's "Der Grüne Heinrich" is contained in two versions. They are, in fact, different, but there's a whole lot of identical passages, too. Depending on your type of research question, this could cause an unevenness.
- Duplicate: Löns's "Dahinten in der Heide" is contained again in the collection "Aus Forst und Flur".
- Duplicate: Löns's "Der letzte Hansbur", "Das zweite Gesicht", "Die Häuser von Ohlenhof" appear again as part of "Sämtl. Werke Teil 7".
- Non-German Original: Strindberg's "Beichte eines Toren" is not a German original.
- Non-Fiction: Heinrich Heine's "Geständnisse" and his "Reisebilder" would have to be regarded as non-fictional.
- Non-Fiction: Seume's "Spaziergang nach Syrakus" is non-fictional.
- there might be more :-)

History