ColloCaid Sample Data

dataset

posted on 2021-03-05, 14:33 authored by Ana Frankenberg-GarciaAna Frankenberg-Garcia, Geraint Paul Rees, Robert Lew

COLLOCAID SAMPLE DATA

The ColloCaid Sample Data comprises approximately 2% of the ColloCaid lexical database. The sample covers 692 strong academic English collocations (LogDice >5.0) for 16 core academic lemmas used as collocation bases (or nodes): 5 nouns, 5 verbs, and 6 adjectives. The selection aims to give an overview of the range of data included in the full dataset. This includes collocations with bases classified with more than one part-of-speech tag (e.g. DEBATE, INDIVIDUAL), polysemous collocation bases giving rise to distinct collocation patterns (e.g. CODE), as well as collocation bases that evoke a very large and a very small number of collocations. The strongest eight lexical collocations listed for each base are enriched with three different curated example sentences adapted from corpora of expert academic English writing.

COLLOCAID LEXICAL DATA 1.1

The full ColloCaid lexical dataset consists of:

• 572 core academic English lemmas (311 nouns, 184 verbs and 77 adjectives)

• 32,645 academic collocations with the above lemmas

• 29,028 example sentences of collocations in context

Further information at http://www.collocaid.uk/

Funding

AHRC AH/P003508/1

History

Usage metrics

Keywords

lexical dataset collocations Academic English Applied Linguistics and Educational Linguistics Computational Linguistics Lexicography Linguistics

Licence

CC BY 4.0