Data and tools for studying isograms

2017-07-31T13:57:59Z (GMT) by Florian Breit
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).<br><br>Below follows a brief description, first, of the included datasets and, second, of the included scripts.<br><br><b>1. Datasets<br></b>The data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.<br><br><i>1.1 CSV format</i><br>The CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.<br><br>The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):<br><br><table> <tr> <td><b>Label</b></td> <td><b>Data type</b></td> <td><b>Description</b></td> </tr> <tr> <td>isogramy</td> <td>int</td> <td>The order of isogramy, e.g. "2" is a second order isogram</td> </tr> <tr> <td>length</td> <td>int</td> <td>The length of the word in letters</td> </tr> <tr> <td>word</td> <td>text</td> <td>The actual word/isogram in ASCII</td> </tr> <tr> <td>source_pos</td> <td>text</td> <td>The Part of Speech tag from the original corpus</td> </tr> <tr> <td>count</td> <td>int</td> <td>Token count (total number of occurences)</td> </tr> <tr> <td>vol_count</td> <td>int</td> <td>Volume count (number of different sources which contain the word)</td> </tr> <tr> <td>count_per_million</td> <td>int</td> <td>Token count per million words</td> </tr> <tr> <td>vol_count_as_percent</td> <td>int</td> <td>Volume count as percentage of the total number of volumes</td> </tr> <tr> <td>is_palindrome</td> <td>bool</td> <td>Whether the word is a palindrome (1) or not (0)</td> </tr> <tr> <td>is_tautonym</td> <td>bool</td> <td>Whether the word is a tautonym (1) or not (0)</td> </tr> </table><br><br>The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:<br><br><table> <tr> <td> <strong>Label</strong> </td> <td> <strong>Data type</strong> </td> <td> <strong>Description</strong> </td> </tr> <tr> <td> !total_1grams </td> <td> int </td> <td> The total number of words in the corpus </td> </tr> <tr> <td> !total_volumes </td> <td> int </td> <td> The total number of volumes (individual sources) in the corpus </td> </tr> <tr> <td> !total_isograms </td> <td> int </td> <td> The total number of isograms found in the corpus (before compacting) </td> </tr> <tr> <td> !total_palindromes </td> <td> int </td> <td> How many of the isograms found are palindromes </td> </tr> <tr> <td> !total_tautonyms </td> <td> int </td> <td> How many of the isograms found are tautonyms </td> </tr> </table><br><br>The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.<br><br><i>1.2 SQLite database format</i><br>On the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:<br><br>• Compacted versions of each dataset, where identical headwords are combined into a single entry.<br>• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.<br>• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.<br><br>The intersected dataset is by far the least noisy, but is missing some real isograms, too.<br><br>The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.<br><br>To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.<br><br><b>2. Scripts</b><br>There are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).<br><br><i>2.1 Source data</i><br>The scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from [http://storage.googleapis.com/books/ngrams/books/datasetsv2.html] (download the 1-gram files ending “-a” through “-z”) and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).<br><br>For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.<br><br><i>2.2 Data preparation</i><br>Before processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.<br><br>Tidying and reformatting can be done by running one of the following commands:<br><br>python isograms.py --ngrams --indir=INDIR --outfile=OUTFILE<br><br>python isograms.py --bnc --indir=INFILE --outfile=OUTFILE<br><br>Replace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.<br><br><i>2.3 Isogram Extraction</i><br>After preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:<br><br>python isograms.py --batch --infile=INFILE --outfile=OUTFILE<br><br>Here INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.<br><br><i>2.4 Creating a SQLite3 database</i><br>The output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:<br><br>1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).<br>2. Copy the “create-database.sql” script into the same directory as the two data files.<br>3. On the command line, go to the directory where the files and the SQL script are. <br>4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.<br><br>See the section 1 for a basic descript of the output data and how to work with the database.<br><br><i>2.5 Statistical processing</i><br>The repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.<br><br><br><br><br><br><br><br>