Record Linkage Datasets

dataset

posted on 2022-04-02, 20:51 authored by Ahmed SolimanAhmed Soliman

This simulated dataset is a corrupted segment from the Social Security Death Master File (SSDMF) available at https://ssdmf.info/. There are 11 original datasets: ``dsxo`` where `x` runs from `1...11` and the suffix `o` stands for `original`. The sizes (number of original records) of these datasets are as follows:

| dataset | size |

|:----------:|:----:|

| ds1o | 10K |

| ds2o | 20K |

| ds3o | 40K |

| ds4o | 80K |

| ds5o | 120K |

| ds6o | 160K |

| ds7o | 200K |

| ds8o | 400K |

| ds9o | 600K |

| ds10o | 800K |

| ds11o | 1M |

These original records are then corrupted via a modified version of the `dsgen` Python script by `Peter Christen`.

The modified/corrupted files are saved as: ``dsxm`` where the suffix `m` stands for `modified`.

The modified records plus four original replicates are concatenated and mixed up (by the Linux command tool `shuf`).

The resultant datasets are named: ``dsx.0`` ``(dsx.1)`` before(after) shuffling.

So, the sizes of these datasets are as follows:

| dataset | size |

|:-------:|:----:|

| ds1.1 | 50k |

| ds2.1 | 100k |

| ds3.1 | 200k |

| ds4.1 | 400k |

| ds5.1 | 600k |

| ds6.1 | 800k |

| ds7.1 | 1M |

| ds8.1 | 2M |

| ds9.1 | 3M |

| ds10.1 | 4M |

| ds11.1 | 5M |

Furthermore, each dataset is split into two halves to serve as input for record linkage algorithms. For example, ds1.1 is split into ds1.1.1 & ds1.1.2.