UQV100 Test Collection - Resource Description -------------------------------------------------------------------------------- Version: 1.1.0 Previous version: 1.0.0 Date: May 2016 Authors: Peter Bailey (Microsoft) Alistair Moffat (The University of Melbourne) Falk Scholer (RMIT) Paul Thomas (Microsoft) The files in this repository form part of the UQV100 test collection and a set of additional related resources used in the creation of the collection. The test collection is described in the following paper: Peter Bailey, Alistair Moffat, Falk Scholer, Paul Thomas "UQV100: A Test Collection with Query Variability" in Proc. SIGIR 2016, Pisa, Italy, July 2016, http://dx.doi.org/10.1145/2911451.2914671. If you use the resource, please cite the paper. The document corpus that all document ids come from is ClueWeb12 www.lemurproject.org/clueweb12.php/, restricted to the Category B subset. People will require their own access to the underlying ClueWeb12 corpus to access any of the documents referenced by id in this collection. All other data is freely available from a public data repository, http://dx.doi.org/10.4225/49/5726E597B8376. Some data files include URLs that render the ClueWeb12 document from the corresponding rendering service provided by CMU. We thank Jamie Callan (CMU) for permission to include these URLs in this collection. Similarly, the information narrative backstories that are contained in this test collection were derived from topics/subtopics that were created for the TREC Web tracks run in 2013 and 2014. We reference these TREC sourced topics/subtopics as a foreign key relationship. All items in the UQV100 test collection have their own primary key, with a "UQV100." prefix. Note that despite our best efforts in authoring the information seeking narratives herein, there may be some element of topic/subtopic intent drift. That is, our backstories may not reflect exactly the same goal of the original TREC topic/subtopic combination. We have created our own relevance labels, using a similar (but definitely not identical) judging process, and these have been made with respect to the backstories (not the topic/subtopic descriptions). As such, you absolutely should not re-use the original TREC Qrels from the corresponding topic/subtopic combination in terms of evaluation of effectiveness measurements when using the queries in this test collection. Similarly, you should only use these relevance labels for this UQV100 test collection and not try to use them for the topics/subtopics in the corresponding TREC Web test collections. If you do either, reviewers will rightly reject your experimental conclusions - do so at your own risk. The judging was carried out with respect to a custom set of guidelines, which are included in the files here. Judges had to meet a minimum standard of effectiveness in understanding the guidelines to match the given labels from a set of gold hits. Subsequently, they were also periodically assessed automatically using additional gold hits to detect their ongoing consistency. Both the qualification and antispam gold hits data file are included. The six rating labels (and the unjudged indicator) provided to documents have the following integer ordinal values in the data files: 4 : Essential 3 : Very Useful 2 : Mostly Useful 1 : Slightly Useful 0 : Not Useful -1 : Junk -100 : Not Judged Refer to the guidelines document for more details on how these rating labels should be understood and interpreted. We have also included the TREC TopicQuery in some files, although these were never shown to judges or other data providers; however it can serve as a convenient shorthand to refer to the specific item under consideration. Again, we would like to thank Ian Soboroff and Ellen Voorhees (NIST) for permission to include the TREC topic, subtopic, and query text from the original test collections herein. We carried out a two-stage pooling process for the initial Indri-BM25 system's runs. Subsequently, additional pooling of documents from systems was carried out, in multiple judging rounds. These systems were: Indri-LM, Atire, Terrier-DFR, and Terrier-PL2. We are very grateful to Xiaolu Lu, Andrew Trotman, Matt Crane, and David Maxwell for their assistance in preparing these system runs. The runs and pools from these systems are not included, however the labeled documents from these pools are included in the corresponding qrels and item-data files below. The files in this test collection are as follows: uqv100-allfiles.zip : (zip archive) All of the files making up this archive and listed below, including this README.txt. uqv100-backstories.tsv : (tab-separated) The information narratives (backstories) created from TREC topic/subtopic descriptions, that were then used as seeds for surrogate users to provide queries and effort estimates. The first column contains the unique UQV100 id in the form of UQV100.xyz. An AuthorId column is present simply to retain original data sources. Other columns should be self explanatory by column name. uqv100-query-variations-and-estimates.tsv : (tab-separated) A master file containing raw and processed data that has an entry for each query variation obtained (post junk worker data cleaning), including raw, normalized, and normalized and spell- corrected variations of the queries. The individual document and query expectation estimates are provided, as well as per-backstory averages of these values. uqv100-item-data-raw-labels.tsv : (tab-separated) A file for all individual crowd judge labeled items, one line per judgment. This data was used as the input to the following aggregated label files. This data can also be used in experiments for computing alternative aggregate labels over the data; however, there is no gold standard "truth" to which these can be compared. Judge ids are anonymized, but consistent. (Caveat: judging of different pools of documents was done at different times, and with the additional qrels a single judge may end up with at least 2 anonymized ids, from the post-processing of their judging contributions to the pools.) The time taken for each judgment (in ms) is included. Items that were judged in the initial single-judge round for the depth 10 pooling of the Indri-BM25 system include the tag "notShown" for the RealUrl field; all subsequent judging also displayed the real URL of the corresponding document from ClueWeb12. All labeled documents are contained within the ClueWeb12-Category B dataset. uqv100-item-data-median-labels.tsv : (tab-separated) A file for all judged documents, with their median labels as aggregated from the up-to-3 crowd judges. Additional information is included, such as the backstory, and other potentially useful data such as the render and original URLs, judging label names, and the backstory. Documents from the pools that were unable to be labeled due to page load, login, foreign language or other issues are marked as -100 (Unjudged). All labeled documents are contained within the ClueWeb12-Category B dataset. uqv100-item-data-median-cbcc-bcc-majority-labels.tsv : (tab-separated) A file for all labeled documents, with their median labels as aggregated from the up-to-3 crowd judges. The output from three alternative label aggregation algorithms are also included. These are the Community-based Bayesian Classifier Combination (CommunityBCCLabel), the Bayesian Classifier Combination (BCCLabel), and a simple Majority Vote (MajorityVoteLabel). We are grateful to Matteo Venanzi for his implementation of these additional algorithms. The Community-based BCC algorithm estimated the ideal number of communities as 8. Additional information is included, such as the backstory, and other potentially useful data such as the render and original URLs, judging label names, and the backstory. In contrast to uqv100-item-data-median-labels.txt, only documents that were labeled are included; that is, there are no items with a -100 (Unjudged) tag associated with them. All labeled documents are contained within the ClueWeb12-Category B dataset. uqv100-item-data-median-cbcc-bcc-majority-labels-combined.tsv : (tab-separated) This file contains all the data from the previous file, plus the same format data applied to all the judged documents from the additional system runs included in this version of the test collection. (Indri-LM, Atire, Terrier-DFR and Terrier-PL2). Data is sorted by topic and docid. uqv100-qrels-median-labels.txt : (space-separated) A file in TREC Qrels format (no header) for the median labels as aggregated from up to 3 crowd judges. The backstory key (consisting of topic and subtopic or 0 if no subtopic), doc id, and label. Labels are expressed in the range -1 to 4, with a -100 if unjudged for some reason (e.g. page was not able to be loaded or required signin etc). All labeled documents are contained within the ClueWeb12- Category B dataset. uqv100-pool-docs-depth10.tsv : (tab-separated) A file containing the documents collected by the pooling algorithms, and including some additional data about the min rank position the doc was found at, and the total count of times it was provided to the pool from across the user query variations. Pooling was carried out from runs over ClueWeb12-Category B index. This pool file only relates to the initial Indri-BM25 system run. uqv100-pool-docs-depth11plus.tsv : (tab-separated) A file containing the documents collected by the pooling algorithms, post the initial depth-10 pooling. The primary selection mechanism used was to calculate those documents which would contribute most to reducing the residual uncertainty in the calculation of INST(T) for the runs. Pooling was carried out from runs over ClueWeb12-Category B index. This pool file only relates to the initial Indri-BM25 system run. uqv100-relevance-guidelines-v1.pdf : (PDF text) The initial set of guidelines used for an initial single-judge evaluation of the depth-10 pool of documents. This version of the guidelines is included only for reference; all subsequent judging used the following v2 guidelines document which showed the underlying document URL to the judge to help them form an assessment of the usefulness of the document. uqv100-relevance-guidelines-v2.pdf : (PDF text) The revised set of guidelines used for overlap (second and third judge evaluation of the depth-10 pool, and for the three-way overlap judging of the depth-11plus pool). The primary difference was the incorporation of document's original URL to be shown in the judging interface. uqv100-antispam.tsv : (tab-separated) uqv100-goldhits.tsv : (tab-separated) Labeled hits, with explanations for the label in the goldhits.tsv file, used in the initial qualification of crowd-sourced judges, and for subsequent dynamic assessment of rating compliance by qualified judges. uqv100-systemInputRun-all-spelledNormQueries.tsv : (tab-separated) Provides a simple text format file for input to systems, containing all queries, including duplicates. The first column is a unique key, which carries various data attributes separated by '|' characters. The second column is the spell-corrected, normalized query. uqv100-systemInputRun-uniqueOnly-spelledNormQueries.tsv : (tab-separated) Provides a simple text format file for input to systems, containing only single occurrences of each unique query for a backstory. The key in the first column is modified to add a count of how many occurrences of that query occur. uqv100-systemInputRun-uniqueOnly-spelledNormQueries-trecTopicInput.xml : (XML format) Provides a version of the unique queries only input file, but formatted consistently with the current TREC XML topic input. -------------------------------------------------------------------------------- Version history: 1.0.0 - May 2016. Original release. 1.1.0 - July 2016. Minor errors found and fixed, three new files added for query input to systems. Two new item-data files included, that contain combined per-judge labeling data over the newly pooled documents, and aggregated labeled data. Qrels files updated with the additional documents and median labels. Approximately 55k documents now have labels. README.txt -- Information was provided about the use of ClueWeb12-Category B dataset as the basis for indexed documents for the pools and qrels files. Updated information on additional systems included. uqv100-query-variations-and-estimates.tsv -- Four queries had a diacritic or other marks over vowels. These have been corrected to use the English-language character that would correspond. Affected queries from 1.0.0 were replaced: UQV100.003:line 301 UQV100.008:line 764 UQV100.033:line 3476 UQV100.060:line 6490 Also various queries had incorrect term filtering for those which contained punctuation-only characters. Other issues with stripping multiple trailing termination punctuation characters were also detected. Also some phrase quote marks were lost and have been restored. Fixes were applied in appropriate lines. There should be minimal differences for search engines whose query processing system removes punctuation characters. An erroneous single-quote/double-quote in the Backstory column of UQV100.078 was corrected. uqv100-item-data-raw-labels.tsv -- An erroneous single-quote/double-quote in the Backstory column of UQV100.078 was corrected. Doubled double-quote characters were replaced by single double-quotes, and leading and trailing double-quote characters were all removed from the Backstory column data. Also included all the new raw data judging information, from the additional system runs' depth-10 pooled documents that were labeled. uqv100-item-data-median-cbcc-bcc-majority-labels-combined.tsv -- A combined file that combines the original pooling, with the documents from the subsequent system run pools. uqv100-item-data-median-labels-combined.tsv -- A combined file that combines the original pooling, with the documents from the subsequent system run pools. (Only median labels.) uqv100-qrels-median-labels.txt -- File now includes all judged docs, across the various systems. Total of 55,587 labeled docs. uqv100-systemInputRun-all-spelledNormQueries.txt uqv100-systemInputRun-uniqueOnly-spelledNormQueries.txt uqv100-systemInputRun-uniqueOnly-spelledNormQueries-trecTopicInput.xml -- New files that can be used as input for systems when processing all queries in the test collection.