PosterDINI2017.pdf (777.83 kB)
Download file


Download (777.83 kB)
posted on 2017-10-11, 11:25 authored by Mandy NeumannMandy Neumann, Philipp SchaerPhilipp Schaer
The project Smart Harvesting II, funded by the German Research Foundation (DFG), develops software-based solutions for the acquisition and processing of scientifically relevant web data. In the case of manual entry, this work is very labor-intensive and time-consuming. In cases where technical support is already being used, specialised software programs, so-called wrappers, are used for this purpose, which have to be created and maintained by expert software developers. The focus of our project is therefore on the development of low-maintenance wrappers, which can also be easily operated by non-information scientists, e. g. librarians or documentaries, and which can be continuously adapted to new website structures. For this we rely on the open source solution OXPath - an extension of XPath that allows declarative imitation of the interaction with a website and in this context can extract data in a targeted way. In a workshop with librarians and in tutorials with students, we saw that basic knowledge of XML and XPath is already sufficient to get into the process of creating, maintaining and maintaining OXPath wrappers.

In this poster for the DINI annual conference, we present the Smart Harvesting II project and show a concrete application case in which we were able to use the software solutions from the project to generate structured research data from the web in a simple way. For our purpose, web data itself is research data or the subject of research, for example in the field of Natural Language Processing, Machine Learning or Information Retrieval. We see great potential in our project to enable the collection of research data from the web for non computer scientists. We present a first application case in which an established research data set (an IR test collection) with OXPath generated web data could be expanded.


SCHA 1961/1-2