UQV100 Test Collection - Resource Description
--------------------------------------------------------------------------------

Version: 1.1.0
Previous version: 1.0.0

Date: May 2016
Authors: 
 Peter Bailey (Microsoft) 
 Alistair Moffat (The University of Melbourne)
 Falk Scholer (RMIT)
 Paul Thomas (Microsoft)
 

The files in this repository form part of the UQV100 test collection
and a set of additional related resources used in the creation of
the collection.

The test collection is described in the following paper:

Peter Bailey, Alistair Moffat, Falk Scholer, Paul Thomas
"UQV100: A Test Collection with Query Variability"
in Proc. SIGIR 2016, Pisa, Italy, July 2016,
http://dx.doi.org/10.1145/2911451.2914671.

If you use the resource, please cite the paper.


The document corpus that all document ids come from is ClueWeb12
www.lemurproject.org/clueweb12.php/, restricted to the Category B
subset.

People will require their own access to the underlying ClueWeb12 corpus
to access any of the documents referenced by id in this collection.

All other data is freely available from a public data repository, 
http://dx.doi.org/10.4225/49/5726E597B8376.

Some data files include URLs that render the ClueWeb12 document from the
corresponding rendering service provided by CMU. We thank Jamie Callan
(CMU) for permission to include these URLs in this collection.

Similarly, the information narrative backstories that are contained in
this test collection were derived from topics/subtopics that were
created for the TREC Web tracks run in 2013 and 2014. We reference these
TREC sourced topics/subtopics as a foreign key relationship. All items
in the UQV100 test collection have their own primary key, with a
"UQV100." prefix.

Note that despite our best efforts in authoring the information seeking
narratives herein, there may be some element of topic/subtopic intent
drift. That is, our backstories may not reflect exactly the same goal of
the original TREC topic/subtopic combination.  We have created our own
relevance labels, using a similar (but definitely not identical) judging
process, and these have been made with respect to the backstories (not
the topic/subtopic descriptions).

As such, you absolutely should not re-use the original TREC Qrels from
the corresponding topic/subtopic combination in terms of evaluation of
effectiveness measurements when using the queries in this test
collection.  Similarly, you should only use these relevance labels for
this UQV100 test collection and not try to use them for the
topics/subtopics in the corresponding TREC Web test collections. If you
do either, reviewers will rightly reject your experimental conclusions -
do so at your own risk.

The judging was carried out with respect to a custom set of guidelines,
which are included in the files here. Judges had to meet a minimum
standard of effectiveness in understanding the guidelines to match the
given labels from a set of gold hits. Subsequently, they were also
periodically assessed automatically using additional gold hits to detect
their ongoing consistency. Both the qualification and antispam gold hits
data file are included.

The six rating labels (and the unjudged indicator) provided to documents
have the following integer ordinal values in the data files:

   4 : Essential
   3 : Very Useful
   2 : Mostly Useful
   1 : Slightly Useful
   0 : Not Useful
  -1 : Junk
-100 : Not Judged

Refer to the guidelines document for more details on how these rating
labels should be understood and interpreted.

We have also included the TREC TopicQuery in some files, although these
were never shown to judges or other data providers; however it can serve
as a convenient shorthand to refer to the specific item under
consideration. Again, we would like to thank Ian Soboroff and Ellen
Voorhees (NIST) for permission to include the TREC topic, subtopic, and
query text from the original test collections herein.

We carried out a two-stage pooling process for the initial Indri-BM25
system's runs. Subsequently, additional pooling of documents from systems
was carried out, in multiple judging rounds. These systems were: Indri-LM, 
Atire, Terrier-DFR, and Terrier-PL2. We are very grateful to Xiaolu Lu, 
Andrew Trotman, Matt Crane, and David Maxwell for their assistance in 
preparing these system runs. The runs and pools from these systems are 
not included, however the labeled documents from these pools are included
in the corresponding qrels and item-data files below.


The files in this test collection are as follows:

uqv100-allfiles.zip : (zip archive)
All of the files making up this archive and listed below, including this
README.txt.

uqv100-backstories.tsv : (tab-separated)
The information narratives (backstories) created from TREC
topic/subtopic descriptions, that were then used as seeds for surrogate
users to provide queries and effort estimates. The first column contains
the unique UQV100 id in the form of UQV100.xyz. An AuthorId column is
present simply to retain original data sources.  Other columns should be
self explanatory by column name.

uqv100-query-variations-and-estimates.tsv : (tab-separated)
A master file containing raw and processed data that has an entry for
each query variation obtained (post junk worker data cleaning),
including raw, normalized, and normalized and spell- corrected
variations of the queries. The individual document and query expectation
estimates are provided, as well as per-backstory averages of these
values.

uqv100-item-data-raw-labels.tsv : (tab-separated)
A file for all individual crowd judge labeled items, one line per
judgment. This data was used as the input to the following aggregated
label files. This data can also be used in experiments for computing
alternative aggregate labels over the data; however, there is no gold
standard "truth" to which these can be compared.  Judge ids are
anonymized, but consistent. (Caveat: judging of different pools of
documents was done at different times, and with the additional qrels a
single judge may end up with at least 2 anonymized ids, from the
post-processing of their judging contributions to the pools.) The time
taken for each judgment (in ms) is included. Items that were judged in
the initial single-judge round for the depth 10 pooling of the
Indri-BM25 system include the tag "notShown" for the RealUrl field;
all subsequent judging also displayed the real URL of the
corresponding document from ClueWeb12. All labeled documents are
contained within the ClueWeb12-Category B dataset.

uqv100-item-data-median-labels.tsv : (tab-separated)
A file for all judged documents, with their median labels as aggregated
from the up-to-3 crowd judges. Additional information is included, such
as the backstory, and other potentially useful data such as the render
and original URLs, judging label names, and the backstory. Documents
from the pools that were unable to be labeled due to page load, login,
foreign language or other issues are marked as -100 (Unjudged). All 
labeled documents are contained within the ClueWeb12-Category B dataset.

uqv100-item-data-median-cbcc-bcc-majority-labels.tsv : (tab-separated)
A file for all labeled documents, with their median labels as aggregated
from the up-to-3 crowd judges. The output from three alternative label
aggregation algorithms are also included. These are the Community-based
Bayesian Classifier Combination (CommunityBCCLabel), the Bayesian
Classifier Combination (BCCLabel), and a simple Majority Vote
(MajorityVoteLabel).  We are grateful to Matteo Venanzi for his
implementation of these additional algorithms. The Community-based BCC
algorithm estimated the ideal number of communities as 8. Additional
information is included, such as the backstory, and other potentially
useful data such as the render and original URLs, judging label names,
and the backstory. In contrast to uqv100-item-data-median-labels.txt,
only documents that were labeled are included; that is, there are no
items with a -100 (Unjudged) tag associated with them. All labeled 
documents are contained within the ClueWeb12-Category B dataset. 

uqv100-item-data-median-cbcc-bcc-majority-labels-combined.tsv : (tab-separated)
This file contains all the data from the previous file, plus the 
same format data applied to all the judged documents from the additional 
system runs included in this version of the test collection. (Indri-LM,
Atire, Terrier-DFR and Terrier-PL2). Data is sorted by topic and docid.

uqv100-qrels-median-labels.txt : (space-separated) 
A file in TREC Qrels format (no header) for the median labels as
aggregated from up to 3 crowd judges. The backstory key (consisting
of topic and subtopic or 0 if no subtopic), doc id, and label.
Labels are expressed in the range -1 to 4, with a -100 if unjudged
for some reason (e.g. page was not able to be loaded or required
signin etc). All labeled documents are contained within the ClueWeb12-
Category B dataset.

uqv100-pool-docs-depth10.tsv : (tab-separated)
A file containing the documents collected by the pooling algorithms, and
including some additional data about the min rank position the doc was
found at, and the total count of times it was provided to the pool from
across the user query variations. Pooling was carried out from runs over
ClueWeb12-Category B index. This pool file only relates to the initial 
Indri-BM25 system run.

uqv100-pool-docs-depth11plus.tsv : (tab-separated)
A file containing the documents collected by the pooling algorithms,
post the initial depth-10 pooling. The primary selection mechanism used
was to calculate those documents which would contribute most to reducing
the residual uncertainty in the calculation of INST(T) for the runs.
Pooling was carried out from runs over ClueWeb12-Category B index.
This pool file only relates to the initial Indri-BM25 system run.

uqv100-relevance-guidelines-v1.pdf : (PDF text)
The initial set of guidelines used for an initial single-judge
evaluation of the depth-10 pool of documents. This version of the
guidelines is included only for reference; all subsequent judging used
the following v2 guidelines document which showed the underlying
document URL to the judge to help them form an assessment of the
usefulness of the document.

uqv100-relevance-guidelines-v2.pdf : (PDF text)
The revised set of guidelines used for overlap (second and third judge
evaluation of the depth-10 pool, and for the three-way overlap judging
of the depth-11plus pool). The primary difference was the incorporation
of document's original URL to be shown in the judging interface.

uqv100-antispam.tsv : (tab-separated)
uqv100-goldhits.tsv : (tab-separated)
Labeled hits, with explanations for the label in the goldhits.tsv file, 
used in the initial qualification of crowd-sourced judges, and for 
subsequent dynamic assessment of rating compliance by qualified judges.

uqv100-systemInputRun-all-spelledNormQueries.tsv : (tab-separated)
Provides a simple text format file for input to systems, containing all
queries, including duplicates. The first column is a unique key, which
carries various data attributes separated by '|' characters. The second
column is the spell-corrected, normalized query.

uqv100-systemInputRun-uniqueOnly-spelledNormQueries.tsv : (tab-separated)
Provides a simple text format file for input to systems, containing only
single occurrences of each unique query for a backstory. The key in the
first column is modified to add a count of how many occurrences of that
query occur.

uqv100-systemInputRun-uniqueOnly-spelledNormQueries-trecTopicInput.xml
	: (XML format)
Provides a version of the unique queries only input file, but formatted
consistently with the current TREC XML topic input.


--------------------------------------------------------------------------------

Version history:

1.0.0 - May 2016.
Original release.

1.1.0 - July 2016.
Minor errors found and fixed, three new files added for query input to
systems. Two new item-data files included, that contain combined per-judge
labeling data over the newly pooled documents, and aggregated labeled data.
Qrels files updated with the additional documents and median labels.
Approximately 55k documents now have labels.


README.txt
--
Information was provided about the use of ClueWeb12-Category B dataset as
the basis for indexed documents for the pools and qrels files. Updated
information on additional systems included.

uqv100-query-variations-and-estimates.tsv
--
Four queries had a diacritic or other marks over vowels. These have been
corrected to use the English-language character that would correspond.
Affected queries from 1.0.0 were replaced: UQV100.003:line 301
UQV100.008:line 764 UQV100.033:line 3476 UQV100.060:line 6490

Also various queries had incorrect term filtering for those which contained
punctuation-only characters. Other issues with stripping multiple trailing
termination punctuation characters were also detected.  Also some phrase
quote marks were lost and have been restored.  Fixes were applied in
appropriate lines. There should be minimal differences for search engines
whose query processing system removes punctuation characters.

An erroneous single-quote/double-quote in the Backstory column of
UQV100.078 was corrected.

uqv100-item-data-raw-labels.tsv
--
An erroneous single-quote/double-quote in the Backstory column of
UQV100.078 was corrected.  Doubled double-quote characters were replaced by
single double-quotes, and leading and trailing double-quote characters were
all removed from the Backstory column data.  Also included all the new raw
data judging information, from the additional system runs' depth-10 pooled
documents that were labeled.

uqv100-item-data-median-cbcc-bcc-majority-labels-combined.tsv 
--
A combined file that combines the original pooling, with the documents from
the subsequent system run pools.

uqv100-item-data-median-labels-combined.tsv 
--
A combined file that combines the original pooling, with the documents from
the subsequent system run pools. (Only median labels.)

uqv100-qrels-median-labels.txt
--
File now includes all judged docs, across the various systems. Total of
55,587 labeled docs.

uqv100-systemInputRun-all-spelledNormQueries.txt
uqv100-systemInputRun-uniqueOnly-spelledNormQueries.txt
uqv100-systemInputRun-uniqueOnly-spelledNormQueries-trecTopicInput.xml
--
New files that can be used as input for systems when processing all queries
in the test collection.