Extract eTMA data from View Student Screenshot

Purpose

Slay wants student course and mark data to track their progress and success. This information is available in the eTMA system but not available by direct data dump. Hence, Slay resorted to looking at the student's record in the eTMA system. However, copying the data from the screen revealed a problem; only the data fields copied, and some relevant values, such as student TMA scores, are actually button labels because you can click on the button to retrieve the TMA itself. Button labels, do not copy.

The purpose of this exercise, therefore, was to write a script to extract all relevant data – both data values and button labels – from an export of the View Student record web page.

Problem

The generated View Student record web page html page is invalid. It is malformed with unbalanced and interleaved html elements. Therefore, I cannot use any standard XML parser to extract its content. It so badly formed even tolerant html parsers fail to make a coherent version of the page for data extraction. On further analysis, this is not surprising given that there are missing tags and other tags are erroneously interleaved.

Consider this web page example that you sent me, C660283X-eTMA Admin System - View Student - ms28868 (02954078-V1).html. The data we want is in table rows in the page, and this page has:

Having more closed rows than open rows, understandably, confuses both XML and html parsing tools.

Using Python’s regex to perform a non-greedy ‘findall’ of table rows by using r'<tr.+?/tr>', returns 44 rows, ie an incomplete view of the data.

It is not just the tables that are wrong, other elements are wrong too. These errors also affect retrieving data from the page. For example, there are 10 <div> tags to open a division of the page, and 9 </div> tags to close a division.

Summary

This work cannot sensibly be completed within the remit of the existing project. It requires too much effort. Suggest find another way to export the data from the eTMA system.