Prepare Taxonomy

Workflow

prepare taxonomy workflow

Prepare taxonomy workflow

The first task is to manually collate keywords into an .ini file, written following the rules of a Windows configuration file. Hence, a classification level is defined by a word in square brackets, and the list of words beneath it are the terms that indicate that classification level. Comments in the .ini file are indicated by starting a line with a semi-colon. This format is relatively straightforward to use, as can be seen in the samples supplied in the framework.

The second task is to process the .ini file with the prepare_taxonomy script. The script stems and de-duplicates the terms, producing two files:

Pickle is a Python specific storage method. It is the most efficient method of storing the classifications for re-use in Python. As all scripts in the framework are written in Python, it is used in this framework. The classification could easily be made available to other tools written in other languages if dumped out as a json file for example.

Supplied taxonomies

The framework is supplied with a default taxonomy based on Bloom’s Cognitive taxonomy See: Krathwohl, David R. (2002) A Revision of Bloom’s Taxonomy: An Overview. Theory Into Practice, 41:4, 212-218.

Two other sample .ini taxonomy files are supplied:

Other classification schemes can be added eaily. For example, one classification we would like to explore in a follow on project is BCS’s SFIA+ (British Computer Society’s IT Skills Framework).

Customised taxonomies

Having established that we can apply a taxonomy to the forum posts, we wondered if a tailored taxonomy could be developed to provide more meaningful results than the Bloom Cognitive Taxonomy originally used. To that end, a new script was written to identify significant words in the existing forums.

suggest keywords workflow

Suggest keywords workflow

The suggest_keywords produces \*_significant_words.html in the reports folder, one file for each forum. The script uses tf:idf (term frequency–inverse document frequency) to identify significant words. This means that those words that appear more often in a particular forum (their term frequency) than they appear in all the forums considered together (their document frequency) is identified as significant. This approach produces some insight into each forum, but necessarily a lot of noise. For example, tutor names such as Ursula appear often in their own forums, but not at all in other forums, and so get highlighted as being significant in their forums.

To help understand each forum further and see through this noise, the script has been extended to produce \*_ word_counts.csv in the data folder for each forum, and a summary \*_ word_counts.csv for the original source folder. These files have two columns:

The data is sorted into Count order, from highest to lowest. Hence, the most common word is first. Reviewing these tables is a crude, but helpful, mechanism to see what words are used in the posts, and may further suggest keywords on which to base future taxonomies.

The development of a tailored taxonomy could be the basis for future work but not one we have yet pursued.