%0 Generic %A Wulczyn, Ellery %D 2017 %T Wikipedia Navigation Vectors %U https://figshare.com/articles/dataset/Wikipedia_Vectors/3146878 %R 10.6084/m9.figshare.3146878.v6 %2 https://ndownloader.figshare.com/files/7394782 %2 https://ndownloader.figshare.com/files/4993342 %2 https://ndownloader.figshare.com/files/4993345 %2 https://ndownloader.figshare.com/files/4993348 %2 https://ndownloader.figshare.com/files/6401421 %2 https://ndownloader.figshare.com/files/6401424 %2 https://ndownloader.figshare.com/files/7455673 %2 https://ndownloader.figshare.com/files/7554667 %2 https://ndownloader.figshare.com/files/7554670 %K Wikipedia Traffic Data %K Deep Learning %K Wikipedia %K Wikidata %K Information Systems %K Sociology %K Applied Computer Science %X

In this project, we learned embeddings for Wikipedia articles and Wikidata items by applying Word2vec models to a corpus of reading sessions.

Although Word2vec models were developed to learn word embeddings from a corpus of sentences, they can be applied to any kind of sequential data. The learned embeddings have the property that items with similar neighbors in the training corpus have similar representations (as measured by the cosine similarity, for example). Consequently, applying Wor2vec to reading sessions results in article embeddings, where articles that tend to be read in close succession have similar representations. Since people usually generate sequences of semantically related articles while reading, these embeddings also capture semantic similarity between articles.

There have been several approaches to learning vector representations of Wikipedia articles that capture semantic similarity by using the article text or the links between articles. An advantage of training Word2vec models on reading sessions, is that they learn from the actions of millions of humans who are using a diverse array of signals, including the article text, links, third-party search engines, and their existing domain knowledge, to determine what to read next in order to learn about a topic.

An additional feature of not relying on text or links, is that we can learn representations for Wikidata items by simply mapping article titles within each session to Wikidata items using Wikidata sitelinks. As a result, these Wikidata vectors are jointly trained over reading sessions for all Wikipedia language editions, allowing the model to learn from people across the globe. This approach also overcomes data sparsity issues for smaller Wikipedias, since the representations for articles in smaller Wikipedias are shared across many other potentially larger ones. Finally, instead of needing to generate a separate embedding for each Wikipedia in each language, we have a single model that gives a vector representation for any article in any language, provided the article has been mapped to a Wikidata item.

For detailed documentation, see the wiki page.

%I figshare