Generating synthetic population data from multiple raw data
sources is a fundamental step for many data science tasks with a wide
range of applications. However, despite the presence of a number of ap-
proaches such as iterative proportional fitting (IPF) and combinatorial
optimization (CO), an efficient and standard framework for handling this
type of problems is absent. In this study, we propose a multi-stage frame-
work called SynC (short for Synthetic Population via Gaussian Copula)
to fill the gap. SynC first removes potential outliers in the data and then
fits the filtered data with a Gaussian copula model to correctly capture
dependencies and marginals distributions of sampled survey data. Fi-
nally, SynC leverages neural networks to merge datasets into one. Our
key contributions include: 1) propose a novel framework for generating
individual level data from aggregated data sources by combining state-of-
the-art machine learning and statistical techniques, 2) design a metric for
validating the accuracy of generated data when the ground truth is hard
to obtain, 3) release an easy-to-use framework implementation for repro-
ducibility and demonstrate its effectiveness with the Canada National
Census data, and 4) present two real-world use cases where datasets of
this nature can be leveraged by businesses.