Effective Crowdsourcing of Multiple Tasks for Comprehensive Information Extraction

dataset

posted on 2019-04-10, 07:50 authored by Sangha NamSangha Nam

Introduction

This dataset aims to propose a Korean information extraction standard and promote research in this field by presenting crowdsourcing data collected for four information extraction tasks from the same corpus and the training and evaluation results for each task of a state-of-the-art model. These machine learning data for Korean information extraction are the first of their kind, and there are plans to continuously increase the data volume. The test results will serve as a standard for each Korean information extraction task and are expected to serve as a comparison target for various studies on Korean information extraction using the data collected in this study. The dataset is available for research purposes.

Description

- There are two crowdsourcing .zip files; wiki-10000-part1&2.zip. In each file,

1) task1-1 : Entity Detection

2) task1-2 : Entity Linking

3) task2 : co-reference resolution

4) task4 : relation extraction

- For an entity linking model(https://github.com/machinereading/eld-2018), here is a pre-trained embedding files in el-korean.tar.gz

- For an co-reference resolution model(https://github.com/machinereading/CR), here is a pre-trained embedding files in cr-korean.tar.gz

- For a relation extraction model(https://github.com/machinereading/re-gan), here is a corpus, dataset and pre-trained embedding files in ko-gan-data.zip

- For a relation extraction model(https://github.com/machinereading/re-re-RL-Crowd), here is a pre-trained embedding files in rerl-korean.tar.gz

How to use

All crowdsourcing file are in JSON format. Detail example and usage are in here (https://github.com/machinereading/okbqa-7-task4)

Funding

This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (2013-0-00109, WiseKB: Big data based self-evolving knowledge base and reasoning platform)

History

Usage metrics

Keywords

crowdsourcing entity linking relation extraction coreference resolution natural language processing korean kaist knowledge base information extraction Natural Language Processing

Licence

CC BY 4.0