Spacy pipeline for English Named Entity Linking to Wikipedia/Wikidata
datasetposted on 17.10.2020 by Ben Hammersley
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
A modified version of the standard spaCy model en_core_web_lg (described as an "English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, POS tags, dependency parses and named entities.") with Entity Linking trained on the first 300,000 lines of a gold_entities.jsonl that was itself created from a complete dump of Wikidata and en-Wikipedia on October 11 2020.