Pedagogical Roles of Natural Language Processing Documents

To allow a computational exploration of the learning utility ("pedagogical value") between a learner and a document, we introduce the notion of "pedagogical roles" of documents as an intermediary component. This dataset is a novel annotated corpus of the pedagogical roles of documents from an expanded ACL Anthology corpus.

The current version includes the following pedagogical roles:

- Survey: Is this document a broad survey? A broad survey examines or compares across a broad concept.
- Tutorial: Is this document a tutorial? Tutorials describe a coherent process about how to use tools or understand a concept, and teach by example.
- Resource: Does this document describe the authors' implementation of a system, corpus, or other resource that has been distributed (e.g. public data sets or tools that have been released under an open source-license or are commercially available)?
- Reference Work: Is this document a collection of authoritative facts intended for others to refer to? Reports of novel, experimental results are not authoritative facts; the statement ``grass is green'' is. Reference Works describe different subtopics within a concept.
- Empirical Results: Does this document describe results of the authors' experiments?
- Software Manual: Is this document a manual describing how to use different components of a software?
- Other: Other role (This includes theoretical papers, papers that present a rebuttal for a claim, thought experiments, etc.)

- annotations_raw_average.tsv: Averaged raw annotations. Each pedagogical role score is an average over all annotations of the role for the document.
- annotations_bin.tsv: Binarized version of the annotations. A document belongs to a pedagogical role if a majority of the annotators agree.
- pedagogical_roles.bib: Metadata of documents in annotated corpus. The documents with a source of "web-supplementary" are supplementary documents that were annotated internally.

If you use this dataset, please cite the following paper. We present annotation guidelines, analysis, and initial baseline classification results.

author = {Emily Sheng and Prem Natarajan and Jonathan Gordon and Gully Burns},
year = {2017},
title = {An Investigation into the Pedagogical Features of Documents},
booktitle = {Proceedings of the 12th Workshop on Innovative Use of NLP for
Building Educational Applications}

Associated work that makes use of this corpus:

author = {Jonathan Gordon and Stephen Aguilar and Emily Sheng and Gully Burns},
year = {2017},
title = {Structured Generation of Technical Reading Lists},
booktitle = {Proceedings of the 12th Workshop on Innovative Use of NLP for
Building Educational Applications}

This research is based upon work supported in part by the Office of
the Director of National Intelligence (ODNI), Intelligence Advanced
Research Projects Activity (IARPA), via Air Force Research Laboratory
(AFRL). The views and conclusions contained herein are those of the
authors and should not be interpreted as necessarily representing the
official policies or endorsements, either expressed or implied, of
ODNI, IARPA, AFRL, or the U.S. Government. The U.S. Government is
authorized to reproduce and distribute reprints for Governmental
purposes notwithstanding any copyright annotation thereon.