figshare
Browse
1/1
4 files

SynC Data Sets

dataset
posted on 2019-04-02, 15:28 authored by Zheng LiZheng Li
Generating synthetic population data from multiple raw data sources is a fundamental step for many data science tasks with a wide range of applications. However, despite the presence of a number of ap- proaches such as iterative proportional fitting (IPF) and combinatorial optimization (CO), an efficient and standard framework for handling this type of problems is absent. In this study, we propose a multi-stage frame- work called SynC (short for Synthetic Population via Gaussian Copula) to fill the gap. SynC first removes potential outliers in the data and then fits the filtered data with a Gaussian copula model to correctly capture dependencies and marginals distributions of sampled survey data. Fi- nally, SynC leverages neural networks to merge datasets into one. Our key contributions include: 1) propose a novel framework for generating individual level data from aggregated data sources by combining state-of- the-art machine learning and statistical techniques, 2) design a metric for validating the accuracy of generated data when the ground truth is hard to obtain, 3) release an easy-to-use framework implementation for repro- ducibility and demonstrate its effectiveness with the Canada National Census data, and 4) present two real-world use cases where datasets of this nature can be leveraged by businesses.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC