Charts - Evaluation of mAuth

Version 2 2018-12-01, 20:55

Version 1 2018-12-01, 20:47

dataset

posted on 2018-12-01, 20:55 authored by marilena daquinomarilena daquino

The soundness of the conceptual framework of IQ measures and the ranking model are validated by means of a user study performed by using a web application called mAuth - Mining Authoritativeness in Art History (http://purl.org/emmedi/mauth/search). The application allows users to input the URL of a cataloguing record describing an artwork and to browse the sorted list of attributions fetched in the web of data.

We designed a task-based evaluation. Users performed three tasks remotely and filled in an evaluation form (https://goo.gl/forms/xDLwvCCaEFWm4D5h2). Tasks are designed so as to reproduce three common scenarios in connoisseurship, namely:

Gather information on an artwork whose attribution is unanimously accepted.
Gather information on an artwork whose authorship attribution is debated and that is not sufficiently documented.
Gather information on an artwork whose authorship attribution is debated and that is well-documented.

For all of the three scenarios we measured a number of parameters. For the sake of brevity we discuss here only three measures for assessing the User Satisfaction, namely: the User Satisfaction (US) measure, the Rank Satisfaction Score (RSS) measure, and the Perception of Authoritativeness Score (PAS) measure. The US measure measures whether retrieved information is useful and sufficient to assess the goodness of an authorship attribution. Users were asked to answer the question “Was it easy to find sufficient information for validating the most authoritative authorship attribution?”. The RSS measures user’s satisfaction with respect to the order of results and the score associated to each information source. To evaluate the RSS measure, users were asked to answer the question “Do you agree with the ranking of results (i.e. the score attributed to each provided attribution and the order in the list)?”. The PAS measure is based on the Net Promoter Score (Reichheld and Markey 2011) that measures whether a user would prefer and suggest the most rated attribution as the most authoritative one. To evaluate the PAS measure, users answered the question “Do you agree with the suggested attribution?”. Participants provided the US, the RSS and the PAS measure by using a Likert scale from 1 to 5 (Strongly disagree to Strongly agree). For all of the three measures we calculated the inter-raters agreement by means of the Fleiss Kappa measure (Fleiss 1971). Lastly, we collected users’ feedbacks for improving the ranking model. Users were asked to select one or more dimensions that affect the ranking.

We collected feedbacks from 31 users.

As expected, the US is high in the first and third scenario (84% of user either agree or strongly agree), since the first artwork is unanimously ascribed to the same artist and the third presents plenty of evidences supporting an attribution rather than others. In the second scenario the US is significantly lower (58%) since attributions are less documented, there are only two sources and both are supported by scholars’ opinions, and there is no agreement.

When evaluating RSS, we see that in the first scenario 74% participants either agree or strongly agree; in the second scenario only 38,7% either agree or strongly agree, while 35,5% neither agree or disagree, and 25,8% disagree; in the third scenario 81% either agree or strongly agree. The kappa measure is, indicating a fair agreement between raters.

When evaluating PAS, in the first scenario we see that 84% either agree or strongly agree; only 42% either agree or strongly agree in the second scenario, while 51,6% neither agree or disagree; 71% either agree or strongly agree in the third scenario.

The Fleiss kappa measure (Fleiss 1971) is calculated for the 31 raters that evaluated the three cases according to the five categories of the Likert scale: kappa is 33% when evaluating the US measure, 34% for the RSS measure, and 36% for the PAS measure, indicating a fair agreement between raters.