UK Twitter word embeddings

2016-10-22T16:36:26Z (GMT) by Vasileios Lampos
<div><b>Word embeddings trained on Twitter content geo-located in the United Kingdom</b></div><div><br>The total number of tweets used was approximately 215 million, dated from February 1, 2014 to March 31, 2016. Word2vec has been applied as implemented in the gensim library (https://radimrehurek.com/gensim/).</div><div><br></div><div><b>Settings:</b> Continuous bag-of-words representation (CBOW), the entirety of a tweet as a window, negative sampling (5 noise words), and a dimensionality of 512. </div><div><br></div><div>After filtering out words with less than 500 occurrences, an embedding corpus of 137,421 unigrams was obtained (see <b>vocabulary.txt</b>). The corresponding 512-dimensional embeddings are held in <b>vectors.zip</b>.</div><div><br></div>