Yahoo Password Frequency Corpus
datasetposted on 23.12.2015 by Joseph Bonneau
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
This dataset includes sanitized password frequency lists collected from Yahoo in
For details of the original collection experiment, please see:
Bonneau, Joseph. "The science of guessing: analyzing an anonymized corpus of 70
million passwords." IEEE Symposium on Security & Privacy, 2012.
This data has been modified to preserve differential privacy. For details of
this modification, please see:
Jeremiah Blocki, Anupam Datta and Joseph Bonneau. "Differentially Private
Password Frequency Lists." Network & Distributed Systems Symposium (NDSS), 2016.
Each of the 51 .txt files represents one subset of all users' passwords observed
during the experiment period. "yahoo-all.txt" includes all users; every other
file represents a strict subset of that group.
Each file is a series of lines of the format:
with FREQUENCY in descending order. For example, the file:
would represent a the frequency list (3, 2, 1, 1, 1), that is, one password
observed 3 times, one observed twice, and three separate passwords observed