Data and tools for studying isograms

dataset

posted on 2017-07-31, 13:57 authored by Florian BreitFlorian Breit

A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).

Below follows a brief description, first, of the included datasets and, second, of the included scripts.

1. Datasets
The data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.

1.1 CSV format
The CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.

The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label	Data type	Description
isogramy	int	The order of isogramy, e.g. "2" is a second order isogram
length	int	The length of the word in letters
word	text	The actual word/isogram in ASCII
source_pos	text	The Part of Speech tag from the original corpus
count	int	Token count (total number of occurences)
vol_count	int	Volume count (number of different sources which contain the word)
count_per_million	int	Token count per million words
vol_count_as_percent	int	Volume count as percentage of the total number of volumes
is_palindrome	bool	Whether the word is a palindrome (1) or not (0)
is_tautonym	bool	Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label	Data type	Description
!total_1grams	int	The total number of words in the corpus
!total_volumes	int	The total number of volumes (individual sources) in the corpus
!total_isograms	int	The total number of isograms found in the corpus (before compacting)
!total_palindromes	int	How many of the isograms found are palindromes
!total_tautonyms	int	How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.

1.2 SQLite database format
On the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:

• Compacted versions of each dataset, where identical headwords are combined into a single entry.
• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.
• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.

The intersected dataset is by far the least noisy, but is missing some real isograms, too.

The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.

To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.

2. Scripts
There are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).

2.1 Source data
The scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from [http://storage.googleapis.com/books/ngrams/books/datasetsv2.html] (download the 1-gram files ending “-a” through “-z”) and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).

For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.

2.2 Data preparation
Before processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.

Tidying and reformatting can be done by running one of the following commands:

python isograms.py --ngrams --indir=INDIR --outfile=OUTFILE

python isograms.py --bnc --indir=INFILE --outfile=OUTFILE

Replace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.

2.3 Isogram Extraction
After preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:

python isograms.py --batch --infile=INFILE --outfile=OUTFILE

Here INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.

2.4 Creating a SQLite3 database
The output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:

1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).
2. Copy the “create-database.sql” script into the same directory as the two data files.
3. On the command line, go to the directory where the files and the SQL script are.
4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.

See the section 1 for a basic descript of the output data and how to work with the database.

2.5 Statistical processing
The repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.