Assorted design notes documenting decisions.
This document records design decisions, explaining them and where appropriate highlighting future implications. There is some overlap with the project final report, hence the reference you will read to Report.docx
. In such cases, look for a section in the report with the same name as the section in this document.
Currently (09 November 2016) we apply four criteria:
While you could get away with a simple string search for 'http' it is not implemented that way in this framework.
Python's string find method does work, as a means of searching for http in a post. Indeed, if passed 'http' it will match against both 'http' and 'https'. The former being a substring of the latter. However, this approach does mean that any use of 'http' in the post will match. There are very few uses of 'http' in the text of a post, and as it happens in all cases for which we currently have data, that post also has exactly the sort of external link we are looking for. However, there is no guarantee that this will hold true in future forum analyses.
Therefore, the search for hypertext protocol as an indicator of an external link is achieved using a regular expression that looks for a following colon too.
Forums with 'preferences' in them, found when looking for 'eferences':
Forum regex select string select w w/o tot w w/o tot M813 15E Software issues 4 48 52 5 47 52 M813 14E Tracking the leading edge forum 187 279 466 189 277 466 M813 14E Organisation and scope forum 14 544 558 15 543 558 M811 15K Module discussion 711 1476 2187 713 1474 2187
Simple inline reference favoured by students because it is easy to type into a forum post using the VLE editor.
Seems to work as a proxy for the full Harvard-style reference that students are meant to use.
Currently (09 November 2016) the scripts are run directly from a command line. To change the folders a script uses, you need to edit the script. Ideally, the folders should be arguments. Or selected via a GUI.
Need to rework really. As noted in scripts/TODO.md.
Only the final version of a post is included in the analysis. Earlier drafts are excluded by testing the posts's 'oldversion' value. This test is implemented in the is_current()
function, used in several scripts.
All scripts are Python so let's use this efficient form of data storage. Might be an issue if data has to be shared with tools written in another language.
See Report.docx
. Note, current (17 November 2016) filterfalse.py
derived from an earlier version of filter_posts.py
. See Git commit 8a10c39 ("Add qualifier to report name for clarity in reports folder.", 2016-11-12) for difference.
See Report.docx
.
See Report.docx
.
See Report.docx
.
To reduce the work of analysing forum text in suggest_keywords.prepare_text()
, stopwords are removed. The list used is that developed by the University of Glasgow's School of Computer Science, downloaded from <http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words>.
This is used in preference to nltk's list of stopwords because:
stop_words.txt
To make the keyword suggestion processing and output manageable, the framework follows standard practice and excludes non-alphabetic ‘words’ in the text such as numbers, words of only one or two characters, and stop words, ie common words such as ‘and’, and ‘the’. See the function suggest_keywords.prepare_text()
.
Moodle Backup Zip files, file extension .mbz
, are actually tar files using gzip for compression. They are our means of downloading the raw forum xml data from Moodle.
LTS could not provide us with the forum xml data used by Moodle, but LTS could provide us with a full backup of the course. Therefore, we accepted a workflow that started with LTS providing us with this manual backup from Moodle followed by our manual extraction of the forum xml from the backup. (I used 7-zip.)
Development of an extract_forum.py
script was halted during the project because it was being built on the basis of a manual full dump of the course and we were unsure if this was a long term method of acquiring the forum xml data.
However, since the project ended I have completed extract_forums.py
to automate the extraction of forum xml data from Moodle .mbz
backups because I do not think that there will be an alternative workflow providing a direct export from Moodle of just the chosen forum files any time soon. Be aware, this may change!
matplotlib
suggests that you do not trace more than 20 graphs at a time, owing to its default memory constraint. The sample forum data contains 26 forums. Hence, draw_posts.py
draws 26 plots and matplotlib
issues a warning. This warning can be ignored because the script is generating simple line charts only, and will not challenge matplotlib
's memory constraint.
Looking to produce a simple visualisation in the framework, I used the well known Python package matplotlib for this task. There are no magic criteria involved in this choice over say Pillow, simply that matplotlib does the job easily and is an package that is going to stay around.
Not knowing exactly which NLP techniques would be required by the project, I chose to use nltk as a Python package that should cover all likely requests.
The use of NLP for NER has been dropped from the framework. See Report.docx
.
This leaves only tokenization and stemming as services provided by nltk. There is possible future work that calls on NLP techniques that are met by nltk. In their absence it is possible to remove the external dependency on nltk by implementing a simple tokenizer (remove punctuation and split on whitespace) but stemming remains a problem. You will not want to write your own stemmer. Therefore, you still need to use a Python package. An excellent Python package alternative to nltk for stemming is snowballstemmer, especially in conjunction with PyStemmer. Strongly recommended, because it is far faster than nltk. Note, it will give slightly different results to the existing use of nltk in the framework, because nltk is set to use the Lancaster stemming algorithm whereas snowballstemmer uses Porter2. Both algorithms are an advance on Porter, and in the context of this research it doesn't matter which we use.
Using itertools.groupby()
to group on date just not worth it! More trouble than it's worth having to work out how to consume the multiple iterators it would return.
It might appear tempting to flatten the nested if statements for is_current()
and has_links()
in keeping with the zen of python. However, this has an unintended consequence. If flattened the else statement processes all previous edits of a post, whereas we want to analyse the final versions of posts only.
The purpose of the else statement is to let us count of the number of posts that do not include external links. Hence, we cannot remove it. Simply subtracting the number of posts with links from the total number of posts gives a misleading number, because draft versions of a post are included.
Similarly for filterfalse_posts.filterfalse_forum()
.
But this is not true for track_filtered_posts.get_post_metadata()
because there is no else statement. We want to process all posts that are the final version and have external links, and ignore the other posts. Hence, we flatten the if statements in this function.