da_clinical_reprs.tar (706.63 MB)

Word Representations for Clinical Danish

Download (706.63 MB)
dataset
posted on 27.05.2020, 13:58 by Leon Derczynski
Word embeddings and word clusters for Clinical Danish, drawn from the heavily-anonymised E4C resource (https://doi.org/10.1177/1460458216647760) and presented here as statistical aggregate data over those records. Vocabulary of 382737 words. Vectors have 100 dimensions. Clusters generated using Generalised Brown clustering with a=2500 and a minimum count of 3; coarser clusters can be generated rapidly from the included mergefile (see https://github.com/sean-chester/generalised-brown/blob/master/cluster_generator/cluster.py)

Data statement included

Funding

Novo Nordisk Foundation NNF19OC0059138, ClinRead

History