Tailoring Data Source Distributions for Fairness-aware Data Integration

Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this paper, we study how to acquire such data in the most cost effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal costs. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.


INTRODUCTION
The standard assumption in machine learning is that we have, at hand, a training data set that is a representative sample of the data that will be seen in production.This assumption is easily satisfied if the training data can be obtained by randomly sampling from the "full" data set in production.However, such random sampling is frequently not possible.Often, this is because production data has not yet been generated at the time the model is trained.At other times, the entire point may be to repurpose and reuse data collected for other purposes.Insufficiently representative training data has resulted in many data science debacles [27,50,60,71].Even when the distribution is accurately characterized, it may not be so easy to obtain training data from the same distribution.For example, surveys may be sent out to a carefully chosen random sample, but only a fraction of surveys are returned, with the return rate not being completely random.Survey statistics has developed sophisticated techniques to handle such lack of randomness [35].Similar issues arise when analyzing online comments or tweets to gauge popular opinion.We wish that the opinions expressed be representative of the target population of interest (e.g.all voters or all customers), but we know that we only have a skewed sample with the most vocal individuals, potentially skewing young and more tech-savvy individuals.Beyond the need for representation to reduce model error, it may sometimes be important to show adequate consideration of minority groups.Even where representative samples can be obtained for training data, that still may not be sufficient in some circumstances.To ensure that minority entities are adequately considered, we may need to train with data in which small minorities are intentionally over-represented [21,24].Similarly, when we are interested in characterizing rare events, we may need training data that has rare events over-represented.For example, to learn how to handle emergencies, we need car-driving data with accidents and near-accidents over-represented: representative driving data may involve few challenging scenarios [55].To summarize, data scientists often have distribution requirements on data sets they wish to use for training or analysis.
To see how to meet these requirements, we now turn to where the data come from.Sometimes, the data may explicitly be collected by the data scientist for the analysis at hand, using surveys, sensors, or other data collection means.Alternatively, data scientists could rely on secondary data instead: using data that have been collected previously for some other purpose.The number and variety of data sources available has been increasing rapidly, making secondary data analysis much more attractive.In fact, the data scientist on many occasions may be spoiled for choice.Since each data source is collected in some manner over some population, it will have its own distribution, which may differ from the distribution desired by the data scientist.The question to ask then is whether data from multiple sources can be mixed to achieve the desired distribution.This is the central problem we study in this paper.Example 1: A data science company has been asked to build an ML model for a local bank in Texas who wants to offer a loan to employees with yearly income of more than $75K.The model should predict the likelihood that an individual will pay back the loan.The company considers building a model on an in-house data set.Being aware of recent incidents of racial/gender biases in similar predictive tools [20], the company wants to make sure different demographic groups are suitably considered.It, however, turns out the data set is skewed: while around 40% of samples are white male, it only 15% are non-white female.The company realizes there are alternative external data sources (such as TexasTribune 1 ) they could consider for collecting the data.It establishes a target distribution on counts from different demographic groups (e.g.25% from each demographic group in a data set of 1K samples).The challenge the company faces is how to efficiently query these data sources to collect the data.In § 5.2, we report on an experiment using real data based on this example, to confirm the effectiveness of our solutions. □ Obtaining data from a data source is not free.An increasingly common situation where the costs are explicit is when data are purchased from a commercial data provider [1-3, 5, 67].Even for primary data collection there is a cost per tuple, in terms of access, storage, indexing, and so on.In all cases, we can characterize the cost of obtaining data from any source in a pricing model.Given a set of these data sources, each with its own distribution and pricing model, our goal is to obtain, at least cost, an aggregate data set that satisfies our distribution requirements.This problem is difficult to solve in general because each source has its own distribution, and none may have a distribution that we seek.Furthermore, no combination of sources may provide us with the desired distribution either.In general, we may have to over-purchase and then "throw away" excess data items.And even so, we cannot be guaranteed it is feasible to obtain the desired distribution.In summary, our contributions in this paper are the following: • We introduce the problem of Data distribution Tailoring (DT).To our knowledge, we are the first to propose this problem.( § 2) • When the distributions of sources are known, we propose a dynamic programming algorithm with minimum expected cost.Being pseudo-polynomial, this algorithm is not practical.Therefore, we design the optimal algorithm for binary groups and equicost sources and an approximation algorithm based on coupon collector's problem for the generic case.( § 3) • When the distributions are unknown, we model the problem as multi-armed bandit.Designing a proper reward function, we explore three strategies based on exploration-only, exploitationonly, and upper-confidence bound.( § 4) • In addition to theoretical analysis, we conduct comprehensive experiments on real and synthetic data sources to validate and evaluate the performance of the proposed algorithms.( § 5)

PROBLEM DEFINITION
Query Model: Our goal is to enable integrating data from multiple sources to construct a target data set.A user query describes a target data set with a target schema, consisting of a collection of attributes.For example, the user may be interested in collecting a data set of movie casts with columns {movie_title, actor_name, gender, . . .}.The query specifies a distribution over some "demographic groups", or simply groups2 .We assume a target schema has some "sensitive attributes" such as race and gender that identify the groups as the intersection of domain values.For example, {white-male, white-female, black-male, black-female} can be the groups defined as the intersection of race and gender.We use {G 1 , . . ., G  } to show a set of groups.A the user's query includes count description Q = {Q 1 , . . ., Q  } on {G 1 , . . ., G  }.We note that when a target data set is collected, there is always a size objective -a data set comprising just a handful of tuples could satisfy such ratio constraints but be completely useless for training.Once we add an overall count requirement to a ratio requirement, this becomes equivalent to the count requirement formulation.Many variants of requirements can be posed, depending on the desired application.We discuss several of these in § 7.
Data Model: The input of DT is a collection of sources L = { 1 , . . .,   }.We assume each source has the same schema as the user's target schema.Table 1 lists the notation used in this paper.Therefore, each tuple in a source can be associated with a group by inspecting its sensitive attributes.Data sources can be external, accessible through limited interfaces or APIs, or data views that are the outcome of the discovery and integration over underlying data sets.For example, a source can be defined by a project-join query defined over a database or a data lake.Similarly, web services such as Google Flights API [6], data markets such as Dawex [1], Xignite [3], and WorldQuant [2], as well as data brokers [5,67] are examples of external sources.Sometimes obtaining a source with the same schema as the target schema requires data integration using a projection-join query over data sets that contain some attributes of the query.Continuing with the movie cast example, using the IMDB database [8], the query Π title,gender,••• (︁ title ⊲⊳ cast_info ⊲⊳ name) provides a data source.Of course, since the target schema is user-specific, and given the potentially large size of data sets, computing and materializing the full join for all sources is not efficient.Instead of offline join, existing work proposes ways for obtaining independent and/or uniformly distributed random tuples from the result of join without executing the join [45,47,79].To abstract the access model, we assume a tuple-at-a-time access to a source.This assumption is aligned with external data sources, such as web databases, where a limited interface is often enforced that returns a subset of top- results per query [16,17,48,66].While for concreteness and simplicity, in the bulk of the paper we assume exactly one tuple is returned per query, in § 7, we discuss how our algorithms can be adjusted to relax this assumption.Cost Model: Obtaining samples from different data sources is not for free.Acquiring samples is associated with a cost either monetary or in the form of computation, memory access, or network access cost.Web database APIs (such as Google Flights), for example, allow a limited number of free queries per day from each IP address or would charge per query while enforcing a top- interface [16,17,48,66].Similarly, relying on data brokers may incur monetary costs [1-3, 5, 67].For internal data sources, as explained in the data model, we may need to apply costly pre-processing steps and online join operations in order to discover a sample.Furthermore, such costs may vary from a source to another, depending on factors such as length of join-paths, their joinability, statistics of data sets, and matching cost.To generalize across different contexts, we use   as the cost of sampling from source   .For the cases where each query returns more than one sample or even the whole source, we can amortize the cost across the number of samples.Data distribution Tailoring (DT) Problem: Given a collection L of data sources with query model described above, our goal is to enable building a target data set with the group count distribution specified by the user.That is, given a count description The data source with minimum expected cost of collecting an item of G  at current iteration    The overall frequency of G  in all data sources  Total number of samples taken so far . ., G  }, we would like to query different data sources in L, in a sequential manner, in order to collect samples that fulfill the input count description, while the expected total query cost is minimized.Depending on our knowledge about the data source distributions, two problem versions can be defined for DT.The first problem assumes the availability of group distributions.That is, we know the data source size and the total number of tuples belonging to each group in each data source.Our task is to select a data source to query each time based upon the set of tuples we have already acquired.In many application settings, we may not know much about the data sources.In particular, we may not know the count aggregates for different groups.This gives rise to the second problem, with the same objective as the first problem, but now without any starting knowledge of data distributions in the sources being considered.Solving this problem requires us to learn group distributions for each data source as we go along.

KNOWN DISTRIBUTION MODEL
In this section, we consider the DT problem for cases where we know the group distributions in each data source.

Dynamic Programming
Given the count descriptions Q = {Q 1 , • • • , Q  } our objective is to find the optimal strategy with the minimum expected cost  (Q).The process of collecting the target data set is a sequence of iterative steps, where at every step, the algorithm chooses a data source, queries it, and if the obtained tuple contributes to one of the groups for which the count requirement is not yet fulfilled, it is kept, otherwise discarded.Our first attempt is to develop a dynamic programming (DP) solution.
An optimal source at each iteration minimizes the sum of its sampling cost plus the expected cost of collecting the remaining required groups (  (Q)), based on its sampling outcome.The dynamic programming analysis evaluates this cost recursively by considering all future sampling outcomes and selecting the optimal source in each iteration accordingly.Using the probabilities of discovering a fresh tuple from each group for every data source   , the optimal source is defined as follows.
Let P   be the ratio of tuples from group G  in source   .To simplify the notation, we have introduced If a sample of G  is added to the target (because it is fresh and belongs to a group whose count requirement is not fulfilled), the remaining cost for building the target is   (Q).Therefore, the term ∑︁  =1,Q  >0 P     (Q) is the expected cost of target if we add the current sample to the target.The probability of a sample being discarded is )︁ and in this case we will have to pay the cost  (Q).
In our DP algorithm, we assume data sets are big enough, that is the probability of discovering a fresh tuple from a   does not change over different iterations.We relax this assumption in subsequent sections for our practical algorithms.Following Equation 1, the recursive cost formula is computed as follows.
The DP algorithm follows a cube-filling approach, where every cell [ 1 ,  2 , • • • ,   ] of the (hyper-)cube  contains the value of  ( 1 ,  2 , • • • ,   ) and a direction that shows which data source to select next.Based on Equation 2, to compute a cell in cube  , we only need the values of cells with the same index in all dimensions except  ∈ [1, ], for which the value is   − 1.This can be accomplished by sweeping a diagonal plane over the cube (starting from ) only maintaining the values on the plane.Following this strategy to fill the cube  , the DP algorithm has a pseudo-polynomial time complexity (assuming that  is a small . Similarly, the space complexity of the algorithm is Example 2: Consider sources  1 and  2 and groups G 1 and G 2 .Furthermore, consider the following statistics for the sources. 0.2 0.8  2 1000 3 0.4 0.6 We would like to collect one tuple from each group, i.e.Q = {1, 1}.Starting from  (0, 0), the DP algorithm sweeps a diagonal line from top-left to bottom-right, in order to compute  (1, 1).

Equi-cost Binary DT
The dynamic programming algorithm proposed in the previous section has a computation and memory cost that is pseudo-polynomial.It quickly becomes intractable for cases where count requirements are not small.In this section, we devise a better solution for an important special case: equi-cost binary DT.Fairness issues often involve exactly two demographic groups (such as male/female, black/white, or minority/majority).As a result, much of the existing work on fairness focuses on such cases [29,38,76].Furthermore, the cost of querying every data source is roughly the same in many scenarios.This motivates us to give a special treatment to the design of an algorithm that guarantees minimum expected query cost for equi-cost binary DT.
Similar to § 3.1, we view the process of collecting the target data as a sequence of iterations where, at every iteration ℓ, we should select a data source to query.We use the notation  (Q 1 , Q 2 ) to refer to the optimal expected cost for collecting Q 1 tuples of group G 1 and Q 2 tuples of G 2 .We suppose, at every iteration ℓ,  is the data collected so far, in which   ,ℓ is the number of unique samples of G  from   , i.e.,   ,ℓ = |{ ∈   | ∈  and  ∈ G  }|.For every group G  , let  * ,ℓ ( ∈ {1, 2}) be the data source with the maximum ratio of undiscovered tuples for G  .That is, Suppose  * 1,ℓ =   and the maximum probability for obtaining a tuple from G 1 at iteration Hence, the optimal expected cost for collecting one tuple from G 1 is as follows ( (0, 1) can be similarly computed).
Now, consider a non-marginal case where Q 1 ≠ 0 and Q 2 ≠ 0. To simplify the explanation, let us assume that at iteration ℓ, G 1 is the minority and G 2 is the majority, i.e.P * 1,ℓ ≤ P * 2,ℓ .The following theorem is the key for designing the optimal solution.Theorem 1.Consider the DT problem under the availability of group distributions where there are two groups and the costs for querying data sources are equal.Let G 1 be the minority at iteration ℓ, i.e.P * 1,ℓ ≤ P * 2,ℓ .Selecting  * 1,ℓ to query at iteration ℓ is optimal.
Proof: We provide the proof by contradiction.Let   =  * 1,ℓ .Suppose algorithm A 1 that selects   at iteration ℓ is not optimal.Suppose the optimal algorithm, A 2 , selects  ≠ at iteration ℓ.We show that the expected cost of A 1 cannot be less than A 2 .This contradicts the assumption that Now, subtracting the two values: )︂ 5: ← Query(  ) 8: ← G () // the group of Since the expected cost of A 1 cannot be less that of A 2 , selecting   =  * 1,ℓ to query at iteration ℓ is an optimal solution.□ Algorithm 1 shows the pseudocode of our optimal algorithm for the equi-cost binary groups.At each iteration, the algorithm finds corresponding data sources for G 1 and G 2 .Then depending on which group is in the minority, it queries the proper data source.The algorithm stops when the count requirements of both groups are satisfied then returns the target data set . Example 2 (Part 2): To see a concrete run for a toy example for Algorithm 1, let us continue with Example 2, while assuming the cost to query the two data sources are equal to one.Using the ratios provided in Example 2,  1 1 = 200,  2 1 = 800,  1 2 = 400, and  2 2 = 600.Note that since we consider the equi-cost assumption, the optimal solution is different from the one provided for DP.Given that  1  1 / 1 <  1 2 / 2 ,   =  2 and P 1 = 0.4 (Lines 5 and 6), i.e.,  2 is the optimal data source for G 1 .Similarly,  ′  =  1 and P 2 = 0.8.Since G 1 is the minority, the algorithm queries  2 in Line 11. Suppose the query returns the tuple  1 from the group G 2 .It is then added to the output .We still need to collect one tuple from G 1 .The algorithm, hence, queries  2 again.Suppose the returned tuple  2 also belongs to G 2 .Since Q 2 = 0, tuple  2 gets discarded and the algorithm queries  2 again.Suppose   belongs to G 1 ; the algorithm adds  3 to  and returns the result.□

General DT
As an alternative to the DP solution, in this section, we provide an approximation algorithm for the general non-binary case.In particular, we note that the optimal solution for the binary case decides the data source to query only based on one group (the minority group).This can be viewed as the algorithm focuses on collecting data for one group.We extend this strategy by modeling the problem as  instances of the coupon collector's problem [49], where every -th instance aims to collect samples from the group G  .
We also use the union bound [49] to come up with an upper-bound on the expected cost of this algorithm.
For every group G  , the algorithm first identifies the data source  *  , the most cost effective data source for G  .That is, The algorithm then starts collecting tuples of different groups by querying the data source  *  for each group G  .In fact, while collecting tuples for each group the algorithm will also maintain the tuples of other groups.The algorithm queries corresponding data sources for different groups until the count requirements of the target are satisfied.Theorem 2 provides an upper-bound for the expected cost of this algorithm as an upper-bound for the expected cost of the problem.
Theorem 2. Assuming that each data source  *  in Equation 4contains at least Q  samples from G  , the expected cost of DT (under the availability of group distributions) modeled by  coupon collector's instances each targeting to collect one group, is at most where Proof: Let   be the number of queries the algorithm would issue to collect Q  unique tuples from G  .We note the queries issued to discover the tuples from a group G  may also discover some tuples from other groups.As a result, the set of queries for different groups may intersect.The union bound [49] indicates that the probability of the union of events is no more than the sum of their probabilities.In DT, the cost of collecting the required tuples of all groups is bounded by the sum of the cost of the tuples of each group.This is because while sampling sources to collect the next tuple of a particular group, DT keeps the useful tuples of other groups.Using this principle, the expected cost of queries issued by the algorithm, Ψ, is bounded by For the group G  , the algorithm queries the data source  *  .Let ℎ  [] be the number of queries issued to collect the -th tuple of group G  .For example, ℎ  [1] is the expected number of queries the algorithm issues until the first tuple from ).The number of queries issued at every epoch,   , is computed as follows.
Consider a query that is issued for group G  to  *  during the th epoch.Let  * , be the probability that such query is successful, i.e., it discovers a new tuple from G  .The algorithm has so far discovered ( − 1) tuples and there are (  *  −  + 1) undiscovered tuples from G  at  *  .Therefore, The geometric distribution represents the expected number of trials before a success in a series of Bernoulli trials.When the probability of discovering a fresh tuple of group G  is P * , , following the geometric distribution, we have As a result, )︁ ≃ 693, i.e., the expected cost to collect 100 samples from G  is bounded by 693 queries.This number drops to 500.1 queries for case (b) where the data source size is 1M.Note that, using the %20 ratio, 500 is expected number of queries without considering the duplicates.In case (a) where the data source is small, the chance of discovering duplicate samples is higher, which resulted in around 693-500=193 more queries to collect the 100 samples needed.In case (b), however, the chance of finding duplicates is negligible.□ 3.3.1The Approximation Algorithm.So far, we did not consider any ordering of which group to target first.In our analysis, the  instances of the coupon collector's are executed independently.One message from the optimal solution for the binary case is to first collect data from the minority groups.Note that the chance of collecting data from other groups while collecting data for minorities is higher than finding minorities while targeting to collect other groups.Following this logic, for the sequential algorithm, we apply a practical improvement over the algorithm by collecting data for minorities first.

Algorithm 2 CoupColl
for  = 1 to  do 6: if Q  == 0 then continue if The pseudocode of the algorithm is provided in Algorithm 2. This algorithm first identifies the minority group, i.e. the group for which the most cost effective data source requires the maximum expected cost.Hence, the algorithm chooses the group that provides maximum piggybacking opportunity per unit cost for other groups.This strategy is reduced to the optimal strategy for equi-cost binary DT.At iteration ℓ, the data source with minimum expected cost for collecting a sample from group G  is After identifying the minority group, the algorithm queries its corresponding data source and updates the target data accordingly.

UNKNOWN DISTRIBUTION MODEL
In this section, we study the DT problem when we do not know the distributions of groups in each data source.A naive solution is to first issue "enough" random queries to each of the data sources and estimate the distributions.Then, knowing these distributions, we can use the techniques proposed in § 3.However, this solution can spend too much of the limited query budget for estimating the distributions, especially when there are many data sources or only a small result data set is desired.Therefore, we seek to collect data directly, without first discovering the distributions.To do so, we model the DT problem in the unknown distribution case as a (multi-armed) bandit problem [4,41].

Modeling as Multi-Armed Bandit
Multi-armed bandit refers to a general class of sequential problems with exploration and exploitation trade-off.Formally, a stochastic bandit problem is defined as follows.Consider a set of  resources (arms), where each arm Γ  is associated with an unknown probability distribution   with mean   .In a sequential setting with  iterations, an agent needs to take action by selecting an arm at every iteration.Let A =  1 , • • • ,   be the set of actions taken by the agent.Upon selecting an arm Γ  by the agent as action   , the agent receives a reward   = R (  ) taken from the probability distribution   , therefore, E[R (  = Γ  )] =   .
The objective of the agent is to maximize its expected cumulative reward Let the optimal expected reward at every iteration  be Then, the optimal strategy A * =  * 1 , • • • ,  *  would have the expected cumulative reward ∑︁   =1  *  .Based on this, the notion of regret for not taking the optimal actions is computed as follows.
One can see a straight-forward mapping of unknown DT problem to stochastic bandit problems, where every data source   is an arm   .In a sequential manner, we would like to select arms in order to collect Q  tuples from every group G  .Every arm (data source) has an unknown distribution of different groups and a query to an arm   costs   .We still need to design the reward function according to the outcome of a query and the cost for issuing the query, which we shall explain in § 4.4.

Exploration-only and Exploitation-only
We begin the section by developing the two extreme strategies: exploration-only and exploitation-only.Exploration-only considers zero knowledge about the distributions of groups in data sources.Therefore, in each iteration, it randomly chooses a data source to query.However, since the costs to query each source may be different, it considers equal budget chance across sources.That is, it gives every data source a chance inversely proportional to its cost.Hence, less expensive sources are explored more and, in the end, the expected cost spent on each source is equal.
The exploration-only strategy gives equal chance to exploring every source, and does not use the knowledge it acquires during the process to adjust its strategy.This strategy works well when all sources have similar distributions.But if sources follow different distributions on groups, exploration-only misses the opportunity to focus on sources with higher rewards.
The other extreme is exploitation-only.This method first queries every data source once, then keeps querying the most promising source, without giving any chance for exploration [68].As we shall verify in our experiments, this strategy is suitable for cases with a large number of data sources (in the order of the size of the target data set), and when group distributions vary greatly across sources.The reason is that in such cases, the source with maximum reward value (higher than all other sources) probably has a better expected reward than the average expected reward of other sources (exploration-only), and significant exploration of sources is too expensive.However, it relies on its inaccurate estimates, so it fails to work in most general cases.

Upper Confidence Bound (UCB)
Different strategies have been proposed to balance exploration and exploitation.Probably the most widely accepted is Upper Confidence Bound (UCB) [68] 3 .UCB considers the fact that the statistics are less accurate for less explored arms, and as the number of exploration for an arm increases there is less need to explore that arm.To increase the exploration chance for less-explored arms, UCB considers an optimistic strategy for arms with high uncertainty, hence preferring promising actions to the ones with estimations that are not with high confidence.In other words, UCB favors exploring the arms that have the potential of being optimal.
At every iteration, for every arm, UCB computes confidence intervals for the expected reward, and selects the arm with the maximum upper-bound of reward to be explored next.That is,   = arg max  =1  ¯() +   (), where  ¯() is the average reward gained from the -th arm and   () is the upper confidence bound.The goal in deriving   () is to make sure that with a high probability the expected reward of the -th arm is less than  ¯() +   ().Let   be the number of times arm  has been explored (i.e., data source   has been queried), and  ⊥ () and  ⊤ () be the minimum and maximum reward values for   , respectively.Following Hoeffding's inequality [34], we have We would like the probability of the true reward not being in the interval to be a small value.Hence, setting the probability as  −4 ,   () is derived as:

Reward Function
The critical missing part of the algorithm developed so far is the reward function.That is, if a query to a data source   returns a tuple from the group G  , what the reward obtained is.In order to compute the reward of collecting a tuple from group G  , we raise the question how "hard" it is to collect one tuple of a group.For example, if 90% of the the tuples across different data sources belong to G  , most queries will return a tuple from G  .On the other hand, collecting a tuple from a group that is rare requires more effort, and so should be worth more in reward.As a result, one can argue that the reward of obtaining a tuple from G  is proportional to how "rare" this group is across different data sources.In other words, what is the expected cost one needs to pay in order to collect a tuple from G  .In order to compute the expected cost, we assume we know the overall distribution of groups.Such an assumption is reasonable since overall aggregates are often available in public forms such as Bureau reports.Even in absence of such information, a preprocessing that randomly selects data sources and samples them, can be used for computing these aggregates.Note that acquiring such general statistics would not require extensive queries and the tuples obtained as a result will be used in the target data set.Let, 0 ≤   ≤ 1 be the overall frequency of a group G  .Following the principle of deferred decisions [49] (page 55), if we randomly select a source to query, the expected number of queries required to collect a tuple from G  is E[1  ] = 1/  .Since any source can be selected for sampling, the average cost is  ¯= ( ∑︁  =1   )/.Therefore, the expected cost to collect a tuple from G  is  ¯/  .We would like to assign a high reward to sources that contain tuples of a rare group G  (small   ).We also penalize the reward based on the cost of sampling from the source,   .Therefore, the reward of Algorithm 3 UCB source   with respect to G  , namely (, ) is  ¯/(  . ).Since  īs constant across all sources and groups, we remove it from the reward function and write the reward function as following.

𝑅(𝑖, 𝑗)
In order to efficiently compute the average rewards of data sources at each iteration , for each data source   , we maintain the variable   that shows the number of times   has been queried; moreover, for each group G  , we maintain    as the number of unique tuples from G  that have been discovered by querying   .Using these variables, the average reward of   is following.
Using Equation 11to compute the average rewards, Algorithm 3 follows UCB strategy for the DT problem when distributions are unknown.Similar to Algorithms 1 and 2, Algorithm 3 also has the space complexity  () and every iteration of it is in  ().Nevertheless, the number of iterations depends on the (unknown) data distributions.Assuming that UCB on average requires a lower expected number of iterations than random exploration, we can use the expected number of iterations for exploration-only strategy (# ) as an expected upper-bound for the number of iterations in UCB.Using the principle of deferred decisions [49], # can be computed using the overall distributions.That is, the expected number of queries to collect a sample from G  is 1/  .Hence # is bounded by ∑︁  =1 Q  /  , which bounds the time complexity as , assuming Q  as constant.

EXPERIMENTS
We have developed multiple algorithms in this paper: Known-Binary and CoupColl for the case of known distributions, and Exploit, Explore, and UCB for the case of unknown distributions.
We study all of these, and compare them against a computed upper bound of the expected cost which we calculate using Theorem 2 (Equation 5) and a random sampling-based algorithm, namely Baseline.The algorithms were implemented using Python.All reported empirical results are the average of 30 runs.A run is terminated when 50,000 samples are collected or the target distribution is fulfilled.Our experiments were conducted on a machine with Intel ® Xeon ® Gold 5218 CPU @ 2.30GHz and 512 GB DDR4 memory.

Data Sources
TexasTribune [9]: Compensation data for Texas state employees has been published by the Texas Tribune.We got four employee data sets, each comprising 21 attributes about employees' salary and compensation, employment status, and employer as well the employees' details.Among these are two sensitive attributes of interest: gender and ethnicity.Considering the domain of these attributes in the data sets, we have four groups: {female-nonwhite (FNW), female-white (FW), male-nonwhite (MNW), male-white (MW)}.These data sets consist of 5839, 5839, 5840, and 449 tuples.We consider each data set to be a data source and assume the cost of taking a random sample to be one unit for each source.Flights [7]: Airborne Flights database, published by the Bureau of Transportation Statistics, contains detailed flight statistics from 1987 to present.The carrier on-time performance of each flight is represented by OP_CARRIER_AIRLINE_ID, ORIGIN_STATE_NM, and ARR_DELAY, among other attributes.We downloaded the flight information of carrier airlines from 2018 to 2020.We got 18 data sets of flight data, each related to one airline.We consider the data set of each airline to be a data source and assume the cost of taking a random sample to be one unit for each source.The size of these sources vary from 2,014,380 to 410,674,398 tuples.
IMDB [8]: The publicly available database of IMDB contains information about movies and their casts.We used three data sets title, cast_info, and name which include 30,335,424 tuples of movies, 253,660,001 tuples of the movie casts, and 37,507,374 tuples of cast individual information, respectively.Any analysis on the casts' gender of movies requires joining these data sets.To evaluate our query and cost model, we obtained three data sets from title based on the year of movies, namely title_2014, title_2015, and title_2016, with 36924, 4812, and 384 tuples, respectively.We consider the join of each title data set with cast_info and name as a data source and assume the cost of taking a random sample from each data set to be one unit.
BenchDL: We synthesized a benchmark to evaluate DT on various cost and data distribution settings.To generate a source with  groups,  tuples, and group  as a majority/minority group, BenchDL first assigns tuple ratios to groups according to a distribution model (minority or majority), then generates  tuples according to the tuple ratios.In a majority source, one group has the majority tuple ratio (higher than 1/) while other groups are the minority.For a majority source, BenchDL first initializes all tuple ratios to 1/.To make G  a majority, it iteratively reduces a random  value from a minority group and adds the reduction to the ratio of G  .Note the first group selects  1 from (0, 1/).The next group selects may select  2 from the updated (0, 1/ −  1 ) range, and so on.This guarantees that a minority group has a ratio smaller than 1/ while G  gets a ratio higher than 1/.For a minority source, a similar process is followed where all majority groups are initialized with 1/ ratios while the minority group G  is assigned the ratio  from (0, 1/) and the remaining 1/ −  ratio is distributed among all majority groups at random.Moreover, BenchDL synthesizes collections of sources with various overall distributions by varying the number of minority sources.
BenchDL implements three cost models: 1) equal-cost assigns one unit cost for each sample, 2) random-cost assigns a randomly select cost from (0,1], and 3) skewed-cost assigns costs following a Zipf distribution with parameter .We choose  = 1.7 for experiments of various  and  = 30 for experiments of various  and normalized all to (0,1].

Proof of Concept
Use Case: Suppose the data scientist of Example 1 has access to the four TexasTribune data sources and aims to build a data set of size 200 with demographic parity.The data scientist first considers sampling each data source independently and merging the collected samples.Having the total count in mind, for each data source, the data scientist chooses a sample size that is proportional to the size of the source.Figure 1a shows the ratio of each demographic in the final sample collected by random sampling.Note that a random sampling technique is agnostic to the target counts as well as the distribution of demographics in each source.This results in a data set with on average 39.3% of white male employees and only 15.2% non-white female employees.Assuming that obtaining a sample from any source has unit cost, the cost of collecting this data set is on average 201.23.Alternatively, the data scientist can apply the DT algorithms to assure final data has demographic parity.When the distributions of data sources are known, such a data set is collected with the average cost of 302.26 and when distributions are unknown, a data set with demographic parity can be generated with average cost of 407.5, 333.7, 351.7 using Explore, Exploit, and UCB, respectively.The observations from this experiment are as follows: (1) traditional data collection approaches fail to equally represent non-white and female minorities, and (2) with less than twice extra cost, all of our algorithms could tailor the collected data to include equal counts from all demographic groups.Query and Cost Model: Continuing with the movie cast example of § 2, we evaluate the expected cost of obtaining samples from the IMDB database.Recall creating a source for the movie cast example involves performing a project-join query such as the query Π title,gender,••• (︁ title ⊲⊳ cast_info ⊲⊳ name).Suppose the result of joining title data sets title_2014, title_2015, and title_2016 with cast_info and name generate sources  1 ,  2 , and  3 , respectively.To evaluate our query and cost model, we implemented a simple version of ripple join [47], an online sampling algorithm from a join path.The algorithm starts by taking a sample from the first data set in a join path, then iteratively scans and samples the second data set until it finds a matching tuple with the first sample.Sampling consecutive data sets in join paths continues until a sample with the target schema is obtained.The algorithm then starts over with a new sample of the first data set in the path.We remark that this algorithm yields random but correlated samples from the join path.Other sampling from join algorithms [79] can be used to get independent and uniform samples.The expected cost of obtaining a sample from a join path is the expected number of tuples the algorithm needs to scan and verify to obtain one result tuple.Of course, this cost depends on the distribution of tuples across data sets as well as the number and distribution of values in overlapping tuples.As required by ripple join, we make sure a random ordering of tuples in a data set.A simulation of the described sampling from join algorithm with 30 runs confirms that obtaining a tuple of  1 ,  2 , and  3 incurs a wide range of costs 17,793.8,1,692.5, and 136.5.
Cost-effectiveness: Having discussed two proof of concepts for DT, we turn our attention to its cost-effectiveness.We use BenchDL to generate repositories of 100 binary sources.In each repository, group G 1 is the majority group in  % of sources and the minority in the rest.Figure 1b shows the cost of collecting a binary target data set, consisting of 500 tuples of each group, using a random sampling algorithm and Known-Binary for repositories of various overall distributions.The random sampling algorithm iteratively selects sources at random and obtains samples until the target distribution is satisfied or a sample budget is exceeded.Consider the case when G 1 is the majority group in only one source,  1 , and a minority in the remaining 99% sources.Considering the equal cost for all sources,  1 is the most cost-effective source for collecting samples of G 1 and is selected by Known-Binary.However, the random sampling selects  1 with the probability 1% and 99% of times attempts to collect G 1 from less cost-effective sources which incurs higher overall cost.We observe that as the number of cost-effective sources for a group increases random selection becomes as effective as the optimal Known-Binary.They become on par when a random selection returns a cost-effective source with 50% chance.Since, in practice, one group is often the minority in most sources, we argue that an intelligent strategy for source selection, like DT, is crucial to cost-effective distribution tailoring.

Known Distributions
We now turn our attention to evaluate the performance of our proposed algorithms.In plots of Figure 1d, 2, and 3, the bars (associated with the left-y-axis) show the average cost while the dashed lines (associated with the right-y-axis) show the average number of samples.In the following experiments, the target count distribution comprises of 100 unique tuples of each group.although the number of samples are close, the costs are smaller than Figures 2c-2h.Note that the average cost of a data source in a random-cost model is higher than in a skewed-cost model, which explains the overall lower costs in Figures 2k and 2l than in Figures 2c-2h.Since CoupColl potentially needs more samples to fulfill the minority counts, on average, it pays more for each sample in the random-cost model.Moreover, overall, the costs and numbers of samples taken for the minority repositories are higher than the majority repositories, because tuples of a minority groups are rare in the repository and more sampling iterations, thus higher cost, are required to achieve the target counts.

Comparison to
Baseline.The baseline we consider is a random sampling algorithm.At each epoch, Baseline obtains a batch sample of size twice the largest remaining count requirement among groups and includes the fresh tuples of the batch in the target data set, if needed.The sample batch is collected by randomly selecting a source and obtaining batch samples.Note that in the first epoch Baseline takes the largest number of samples and the sample size decreases until all count requirements are fulfilled.In the first set of experiments, our goal is to build a target data set of 5K flights, using the Flights data set, with equal number of flights from each state.As Figure 1c shows, CoupColl and UCB outperform Baseline, with the former having drastically smaller data collection cost.Explore which selects sources at random is on par with Baseline.The Exploit never successfully terminated and is not included in the plot.CoupColl outperforms the Baseline in cost and sample counts for all m's and n's, across all cost models and distributions.Particularly, CoupColl achieves better performance for the minority data distribution, because Baseline requires multiple sample batches to eventually collect tuples of a minority group.Moreover, since Baseline does not take costs into account, we observe more drastic performance deterioration for skewed and random cost models.

Unknown Distributions
In following experiments, we assume the distributions of the groups in the data sources are unknown, while the overall distribution is known apriori.UCB and Exploit start with one round of sampling from each data source to initialize the approximate distributions.If the total count of a target is small, an algorithm might achieve a target in the first round, especially when the number of data sources is large.To allow the algorithms to proceed to distribution updates, we consider targets with larger total group counts than the known case.In the following experiments, target requires 500 unique tuples of each group.Number of Groups: We first study the behavior of unknown DT algorithms for different numbers of groups (=2, . . ., 10) across all cost models.For each value of  and data distribution, BenchDL generates a repository of 20 data sources with average 5K unique tuples.As shown in Figures 3a-3k, unlike Exploit, Explore and UCB it works with" [19], data collection is considered as a way to address unfairness in predictive models [25].Representativeness of data collection have been widely studied in the literature [28].A notion of data representativeness has been proposed as data coverage [10,14,15,36,46], identifying the demographic subgroups that are not represented in data.The input target distribution to a DT problem can be inferred from the result of coverage analysis.Bias has also been studied in the context of approximate query answering [54], where a database is considered as a sample and the goal is to answer approximate queries as if the queries were issued on the true population.Data Discovery and Data Pricing: Existing approaches for data set discovery [51,80], source selection [59,61], and schema mapping [44,57,65] can be necessary for the source generation step of DT and their cost can be folded into the cost model.Data set discovery is often formulated as a search problem on repositories using keywords [22,56] or another data set [51,80] and the goal is to find relevant data sets based on the relevance to the keywords or integration-inspired measures.A complementary problem to DT is query-based data pricing [42] which decides the price of the data from the perspective of providers.The output of the data pricing problem can be plugged into the cost model of DT.Data Distillation and Cleaning: DT is an instance of the data augmentation problem with some additional conditions on the group counts [26].Moreover, data distillation [58] is particularly applicable in determining the group that a sampled tuple is associated with if such information is absent.Moreover, data cleaning is included in the source preparation process and its cost can be folded into the cost model.Cleaning tasks such as entity resolution are necessary for determining the freshness of samples.

Example 3 :
To better understand Equation 5 with an example, let us consider a group G  , suppose Q  =100, and consider two cases where (a)  *  =1K v.s.(b)  *  =1M.In both cases let  *  =1 and suppose the ratio of G  is %20.Following Equation 5, for case (a) where the data source contains 1000 tuples,  *   *  ln (︁   *  /(

Figure 1 :
Figure 1: (a) Demographic Distributions in Texas Tribune (b) DT vs. Random Sampling (c) DT vs. Baseline on Flights Data (d) Known Binary: Optimal vs. Coupon Collector.

Figure 2 :
Figure 2: Known DT for Minority and Majority Distributions and Equal, Random, and Skewed Cost Models.

Figures
Figures 2a-2l provide more detailed analyses of baseline and DT.CoupColl outperforms the Baseline in cost and sample counts for all m's and n's, across all cost models and distributions.Particularly, CoupColl achieves better performance for the minority data distribution, because Baseline requires multiple sample batches to eventually collect tuples of a minority group.Moreover, since Baseline does not take costs into account, we observe more drastic performance deterioration for skewed and random cost models.

Table 1 :
Table of Notations . <  then  ← }; underlying distribution of groups  1 , • • • ,   Output: , target data set 1:  ← { };  ← 0;  ← 0 2:   ← 1, ∀ ∈ [1, ]  ← Query(  );  ←  +   ;  ←  + 1; ←   + 1;  ←  + 1;  ←  +   ; 15: if ( ∉  AND Q  > 0) then   + 1; Q  ← Q  − 1 17: return 5.3.1 Equi-Cost Binary Case.For this set of experiments, we usedBenchDL to generate 1K binary data sources with average 5K unique tuples.Figure1dreports the cost and number of samples for the Known-Binary and CoupColl when one group is consistently the minority group across all data sources and the cost model is equal-cost.CoupColl is the extension of Known-Binary to nonbinary cases and should reduce to it for equi-cost binary cases.This is consistent with the experiment results where CoupColl follows the same strategy as Known-Binary for binary groups and performs on par in practice.The cost and number of samples slightly decrease as the number of sources increases.This is because with more sources, there is a higher chance of finding better sources for the minorities, i.e., the sources with a greater fraction of the minority tuples of interest.5.3.2GeneralCase.We evaluate DT algorithms for source with known distributions on data sets generated using BenchDL .Number of Groups: We first study the behavior of DT algorithms for different number of groups ( = 2, ..., 10) across all cost models.For each value of  and data distribution, BenchDL generates a repository of 20 data sources with average 5K unique tuples.As shown in Figures2a-2j, the theoretical upper bound of CoupColl is not tight.At each iteration, CoupColl samples from the most cost effective data source of the minority group.This strategy provides the opportunity for piggybacking, that is the algorithm collects the non-minority groups while sampling for the minority.The experiments show CoupColl to be a practical algorithm for the DT problem.It is worth noting that the number of samples and cost increase as the number of groups increases which can be described by the increase of target size (sum of the counts of groups).Number of Data Sources: Next, we evaluate the behavior of DT algorithms for different number of data sources ( = 10, ..., 1000) across all cost models.For a data distribution and  of interest, BenchDL generates a repository of  data sources each with average 5K unique tuples that contain four groups.From Figures2c-2l, the cost and number of samples decreases with the increase in the number of data sources.Because, having more sources to choose from increases the chance of finding ones that are more cost effective, especially for the minorities.In particular, consistent across all experiments, the cost and number of samples significantly drop when there are more than 200 data sources.Notably, adding more sources does not decrease the cost much.Still, increasing the number of data sources from 10 to 200 helps with reducing the cost.
Cost Models: The skewed-cost model assigns costs in (0,1] to data sources following a Zipf distribution, that is, cheap data sources have costs closer to zero.This explains why in Figures2k and 2l