figshare
Browse
1/2
33 files

Record Linkage Datasets

dataset
posted on 2022-04-02, 20:51 authored by Ahmed SolimanAhmed Soliman
This simulated dataset is a corrupted segment from the Social Security Death Master File (SSDMF) available at https://ssdmf.info/. There are 11 original datasets: ``dsxo`` where `x` runs from `1...11` and the suffix `o` stands for `original`. The sizes (number of original records) of these datasets are as follows:

| dataset | size |
|:----------:|:----:|
| ds1o | 10K |
| ds2o | 20K |
| ds3o | 40K |
| ds4o | 80K |
| ds5o | 120K |
| ds6o | 160K |
| ds7o | 200K |
| ds8o | 400K |
| ds9o | 600K |
| ds10o | 800K |
| ds11o | 1M |

These original records are then corrupted via a modified version of the `dsgen` Python script by `Peter Christen`.
The modified/corrupted files are saved as: ``dsxm`` where the suffix `m` stands for `modified`.
The modified records plus four original replicates are concatenated and mixed up (by the Linux command tool `shuf`).
The resultant datasets are named: ``dsx.0`` ``(dsx.1)`` before(after) shuffling.
So, the sizes of these datasets are as follows:

| dataset | size |
|:-------:|:----:|
| ds1.1 | 50k |
| ds2.1 | 100k |
| ds3.1 | 200k |
| ds4.1 | 400k |
| ds5.1 | 600k |
| ds6.1 | 800k |
| ds7.1 | 1M |
| ds8.1 | 2M |
| ds9.1 | 3M |
| ds10.1 | 4M |
| ds11.1 | 5M |

Furthermore, each dataset is split into two halves to serve as input for record linkage algorithms. For example, ds1.1 is split into ds1.1.1 & ds1.1.2.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC