A Computational Framework for Preserving Privacy and Maintaining Utility of Geographically Aggregated Data: A Stochastic Spatial Optimization Approach

Geographically aggregated data are often considered to be safe because information can be published by group as population counts rather than by individual. Identifiable information about individuals can still be disclosed when using such data, however. Conventional methods for protecting privacy, such as data swapping, often lack transparency because they do not quantify the reduction in disclosure risk. Recent methods, such as those based on differential privacy, could significantly compromise data utility by introducing excessive error. We develop a methodological framework to address the issues of privacy protection for geographically aggregated data while preserving data utility. In this framework, individuals at high risk of disclosure are moved to other locations to protect their privacy. Two spatial optimization models are developed to optimize these moves by maximizing privacy protection while maintaining data utility. The first model relocates all at-risk individuals while minimizing the error (hence maximizing the utility). The second model assumes a budget that specifies the maximum error to be introduced and maximizes the number of at-risk individuals being relocated within the error budget. Computational experiments performed on a synthetic population data set of two counties of Ohio indicate that the proposed models are effective and efficient in balancing data utility and privacy protection for real-world applications.

G eographically aggregated data are frequently used in social, economic, demographic, and geographic applications because they provide useful and important statistics while protecting the privacy of individuals. For example, the U.S. Census Bureau aggregates individual-level census responses at various geographic levels such as block, block group, and tract after removing personal identifiers (e.g., names and addresses). In this way, demographics are published by population counts rather than by each person, and individual privacy is considered to be preserved. Information in geographically aggregated data, such as demographic attributes and geographic units, can still be used to reidentify or disclose the identities of data contributors, however (Sweeney 2000), which is prohibited by many privacy laws and regulations. For example, Title 13 of the United States Code 1 states that the Census Bureau cannot "make any publication whereby the data furnished by any particular establishment or individual under this title can be identified." Similar clauses can also be found in the Health Insurance Portability and Accountability Act of 1996 (HIPAA), 2 the Family Educational Rights and Privacy Act (FERPA), 3 and the European Union's General Data Protection Regulation (GDPR). 4 Disclosure can happen in multiple ways using geographically aggregated data (discussed in detail in the next section). One method of disclosure is to directly link the information in aggregated data to publicly available auxiliary data. Sweeney (2000), for example, showed that individuals with unique combinations of zip code, age, and sex can contribute to population counts of one in census data, and these individuals can be reidentified by linking the zip code, age, and sex to publicly available voter registration data. We refer to this type of disclosure as direct disclosure. In addition to direct disclosure, a substantial body of literature has demonstrated the feasibility of combining multiple sets of aggregated data (e.g., multiple census tables) to reconstruct the original individual-level data (Garfinkel, Abowd, and Martindale 2019), which can then be linked to auxiliary data to reidentify individuals, a type of disclosure we refer to as reconstruction-abetted disclosure.
Multiple approaches have been developed to avoid disclosure when using geographically aggregated data (see a detailed review in the next section). Conventional methods are mostly aimed at avoiding direct disclosure by suppressing counts of one or two individuals in the aggregated data (Cox 1980;Kelly, Golden, and Assad 1992) or swapping individuals included in these counts to other locations (Dalenius and Reiss 1982). These methods, although routinely applied in the U.S. Census data releases from 1970 to 2010, lack appropriate measures to indicate how much privacy protection is provided (Shlomo 2007). It is also unclear how they can be used to prevent reconstruction-abetted disclosure (Garfinkel, Abowd, and Martindale 2019). In addition, the use of conventional methods often introduces error that affects data utility (usefulness of the released data to users), and there is no welldefined measure to quantify the impact on utility. Another type of method, which is applied to the 2020 U.S. Census data, is based on a privacy definition known as differential privacy (Dwork and Roth 2014;Dwork et al. 2017). This type of method applies statistical noise during data aggregation to prevent reconstruction-abetted disclosure. A privacy parameter is employed to quantify and adjust the privacy protection for the data. To satisfy differential privacy, however, the utility of aggregated data is often substantially compromised because significant noise is applied to all population counts instead of only to those that need protection (Santos-Lozada, Howard, and Verdery 2020; Winkler et al. 2021).
The purpose of this article is to develop a computational framework that can be used to (1) quantify the disclosure risks and utility of geographically aggregated data, and (2) minimize the risk and preserve data utility. Specifically, individuals at high risk of disclosure are assigned to different locations to protect their privacy, and spatial optimization models are built to optimize these moves so that the protection is maximized while the introduced error is minimized or controlled. In the next section, we detail the rationale of this research by introducing the disclosure problem and examining the limitations of existing disclosure avoidance methods. We then describe our problem formulation before introducing the quantitative measures as well as the proposed spatial optimization models. A set of computational experiments are then used to demonstrate the effectiveness of the spatial optimization models. We conclude the article with a discussion of the limitations and future directions of this research.

Background and Problem Statement
The production of geographically aggregated data starts with collecting data at the individual level and anonymizing them by removing personal identifiers ( Figure 1A). Aggregation can then be performed to generate counts of individuals in each location who share certain characteristics, and these counts are also anonymized ( Figure 1B). As discussed earlier, though, two types of disclosure could occur: direct and reconstruction-abetted disclosure. In our example, direct disclosure can happen when we link the aggregated data in Figure 1Bi with an auxiliary data set ( Figure 1C) that includes names as personal identifiers. As shown in Figure 1Bi, there is only one individual who lives in block 1000 and is married, and the linkage can reveal this individual to be Bob with a probability of one. The linkage has a probability of one in four to reveal an individual to be Alice, because there are four married individuals in block 1100 according to Figure 1Bi. We refer to the probability of revealing an individual's identity as the disclosure risk, which increases when the count of individuals in a location decreases. Reconstruction-abetted disclosure occurs if the original individual-level data can be reconstructed. In our example, if the three sets of aggregated data in Figure 1B are all published, they can be combined to reconstruct the exact individual-level data in Figure  1A using simple methods such as linear optimization (Garfinkel, Abowd, and Martindale 2019). Auxiliary data can then be linked to the reconstructed individual-level data in a similar manner, which can lead to disclosure.
Conventional disclosure avoidance methods primarily aim to prevent direct disclosure. One common method is data suppression (Cox 1980;Kelly, Golden, and Assad 1992), which considers all small population counts to be sensitive and replaces them with a prespecified symbol such as -. This method is used in the U.S. Census from 1970 to 1990, but many problems exist. Because the suppressed counts can be uncovered by subtracting nonsensitive counts from the published population totals, data suppression often requires withholding a significant portion of nonsensitive counts, a process known as complementary suppression (Duke-Williams and Rees 1998;Matthews, Harel, and Aseltine 2016). Suppressing both sensitive and nonsensitive counts can result in a large amount of missing data and therefore significantly compromises data utility (Tiwari, Beyer, and Rushton 2014). Another limitation of data suppression is the lack of well-defined measures for assessing the privacy protection provided as well as the impact on data utility. Specifically, we cannot quantify the risks of direct disclosure after suppression, nor the effectiveness against reconstruction-abetted disclosure (Shlomo 2007;Garfinkel, Abowd, and Martindale 2019). The loss of utility caused by data suppression also cannot be quantitatively evaluated and is not transparent to data users (Shlomo 2007).
Since the 2000s, data swapping has become a prevalent conventional method applied to census data for disclosure avoidance (Dalenius and Reiss 1982;Shlomo, Tudor, and Groom 2010). Different from data suppression, swapping is applied to individual-level data prior to aggregation. The process starts with small population counts in the aggregated data, because individuals included in these counts are considered to be at high risk of disclosure. The location of each high-risk individual is then exchanged with that of an individual not considered at risk from a different geographic unit. The highrisk and nonrisky individuals should share the same values for a set of control variables (e.g., age group, sex, and other simple demographics) that often have large counts and are unlikely to be used to reveal individual identities. In practice, though, data swapping tends to be nontransparent about its key components, such as control variables, which makes it difficult to quantify the privacy protection it can provide as well as how it affects data utility (Shlomo and Young 2006). Differential privacy (Dwork et al. 2006;Dwork and Roth 2014) is a recent privacy definition based on which various methods for preventing reconstruction-abetted disclosure have been developed. These methods introduce statistical noise during data aggregation such that changes to any individual do not significantly affect the aggregated data. By looking at the noisy aggregated data, it would be difficult to determine whether a particular individual's information was used for producing the aggregated data. Accordingly, even when individual-level data can be reconstructed using multiple sets of noisy aggregated data, they will be different from the original data collected, which effectively protects against reconstruction-abetted disclosure (Dwork et al. 2017;Garfinkel, Abowd, and Martindale 2019). Differential privacy-based approaches also use a privacy parameter to quantify privacy protection against reconstruction-abetted disclosure. Because differential privacy requires the perturbation of all aggregated data instead of only the small counts (Cohen and Nissim 2020), however, the aggregated data produced tend to be overly protected with unnecessarily high reductions in data utility (Domingo-Ferrer, S anchez, and Blanco-Justicia 2021). In addition, the noise injected by existing differential privacy-based methods is geographically invariant and might introduce relatively high error to data in small and rural areas (Santos-Lozada, Howard, and Verdery 2020; Winkler et al. 2021).
Protecting privacy while preserving the utility of geographically aggregated data remains a significant challenge in disclosure avoidance. Quantitative measures of data privacy and utility should be developed, as well as methods to optimize how to avoid disclosure when using the data. An ideal practice to avoid disclosure is to preserve as much data utility as possible while achieving maximum privacy protection. We should also note that certain requirements for data utility might exist in practice, such as a budget for the maximum error that can be introduced (Van Riper, Kugler, and Ruggles 2020). In this case, we might wish to ensure as much privacy protection as possible given the error budget. In this article, we aim to address these issues by developing a computational framework that reflects the workflow of how the original individual-level data are transformed into geographically aggregated data. This framework includes measures that can be used to assess disclosure risks and data utility, as well as optimization models for identifying optimal ways to avoid disclosure.

Problem Formulation
Assume that individual-level data are collected in n locations (or geographic units), where each individual has d attributes. Figure 1A illustrates such a data set, which includes three geographic units (blocks 1000, 1100, and 1200) and two attributes (marital status and age group). To summarize or count different individuals using the d attributes, we define a predicate as a tuple of d elements ðs 1 , s 2 , :::, s d Þ, where each element is a criterion that corresponds to the attribute as indexed by the subscript. By applying a predicate to the individual-level data, we can obtain the number of individuals who satisfy the criteria in the predicate. For example, the predicate (Married, 15-24) returns the number of individuals who are married and have an age between fifteen and twenty-four; when it is applied to block 1000, we have a count of one because only person one satisfies those criteria. A feasible predicate should describe individuals that possibly exist in the individual-level data. For the example in Figure 1A, (Married, < 15) is not a feasible predicate because individuals under the age of fifteen should not be married. We use S to denote a set that contains all feasible predicates, and the size of S is said to be m. In our example, the seven predicates listed in Figure 2 form S. Using the set S, we can transform the individual-level data into an m Â n matrix X ¼ fx ki g, where x ki denotes the number of individuals who satisfy the criteria in the kth predicate of S in the ith location. The individual-level data matrix X for our example is shown in Figure 2. Although X comprises population counts and appears to be aggregated data, it is identical to the original individual-level data because the individuallevel data can be reconstructed by simply listing all individuals counted in X. The individual-level data matrix X therefore should not be made public. Instead, it should be further aggregated so that population counts are published by predicate formed by a subset of attributes (rather than by all the d attributes). Specifically, a 7 Â 3 matrix X is used to represent the individuallevel data in Figure 1A.
Aggregation is performed using a collection of counting queries on X. A counting query is used to obtain the count of individuals in a geographic unit that meet the criteria for a subset of attributes. A counting query can be represented using an mdimensional row vector a T 2 f0, 1g m , where the kth element is set to one if individuals who satisfy the criteria in predicate k are included in the count. For example, [1, 1, 0, 1, 0, 1, 0] is a counting query that counts all singles in each block. Formally, the product a T X returns the number of individuals counted in X who meet the criteria specified by counting query a T in each geographic unit. For example, premultiplying [1, 1, 0, 1, 0, 1, 0] by X in Figure 2 returns [7, 3, 5], the counts of singles in blocks 1000, 1100, and 1200.
In practice, multiple counting queries can be used together to construct a large aggregated data set. Assume a collection of q counting queries a T 1 , a T 2 , :::, a T q are used for producing a set of geographically aggregated data. These queries can be arranged by rows to form a q Â m query matrix where a k 0 k is set to one if the k 0 th counting query (1 k 0 q) counts individuals who satisfy the criteria in predicate k, and zero otherwise. Figure 3A illustrates such a query matrix. An aggregated data matrix can be derived as Y ¼ AX ¼ fy k 0 i g, where y k 0 i denotes the number of individuals in location i counted using the k 0 th counting query in A ( Figure 3B). Direct disclosure can occur when using the aggregated data Y, and reconstruction-abetted disclosure happens when the individual-level data X are reconstructed. For any individual counted in Y, the risk of direct disclosure is one over the value of the matrix element that counts this person. For the example in Figure 3B, any individual counted in the first element of Y has a risk of one in seven. Assuming the collected individual-level data can be reconstructed, the risk of reconstruction-abetted disclosure for any individual is one over the value of the matrix element in X that counts this person. We refer to the disclosure risk for an individual as individual disclosure risk. Individual disclosure risks can be averaged to obtain a global disclosure risk for the entire data matrix. Specifically, the global disclosure risk for Y indicates the risk of direct disclosure, and that for X indicates the risk of reconstruction-abetted disclosure.
Individuals at high risk of disclosure significantly impact the global disclosure risk. To determine the individuals at risk, we define a k-element as a nonzero matrix element in X or Y that has a population count less than or equal to an integer k. Any individual counted in a k-element has an individual disclosure risk of at least one in k. For the example in Figure 2, all nonzero elements with values lower than three are three-elements, and the individual disclosure risk for a person counted in these elements is at least one in three. When a small k value is set, individuals who contribute to k-elements have high risks of disclosure, and therefore reducing the number of k-elements can help decrease the global disclosure risk.
The global disclosure risk can be reduced if we transform the individual-level data X before aggregation so that, given a small value of k, the presence of k-elements in X decreases. The rationale is that if the number of k-elements in the individual-level data can be reduced, no matter how the data are aggregated, no new k-elements will appear in the aggregated data, thereby lowering the risk of direct disclosure. In addition, even if the individual-level data can be reconstructed, it would be the transformed data with a decreased number of k-elements. The risk of reconstruction-abetted disclosure is therefore also reduced.
The approach we devise to transform X is that, given a small value of k, each at-risk individual counted in the k-elements of X is assigned to a Figure 3. Aggregation using a query matrix. (A) A 2 Â 7 query matrix A consisting of two counting queries, which are used to obtain population counts by marital status in each block. (B) A 2 Â 3 matrix Y that represents the aggregated data in Figure 1Bi, which is obtained by multiplying the query matrix A with the individual-level data matrix X in Figure 2B. different geographical location with a transition probability, or the probability of assigning an individual from one location to another ( Figure 4). There are numerous advantages to using transition probabilities rather than the binary policy of whether or not an at-risk individual is assigned to a specific location. One advantage is that a range of optimal assignments can be provided instead of one prescribed by the model, which allows us to fully examine the potential outcomes of assigning individuals in geography. In addition, because actual assignments are based on probabilities, randomness is introduced into the assignment process, which helps to preserve the privacy of at-risk individuals.
Assigning individuals to different locations will convert the counts in k-elements in X to zero, at the cost of increasing values of other elements in the same rows of the matrix. To ensure data utility, individuals at risk should ideally have high transition probabilities when assigned to locations with a large population satisfying criteria in the same predicate (i.e., they are moved to large matrix elements in the same row of X). In Figure 4A, for example, the person in block 1100 should have a higher probability of being assigned to block 1000 than to block 1200, because the latter will result in more error relative to the original matrix element. Our goal is to determine the optimal transition probabilities such that we can assign all individuals at risk to new locations while minimizing the introduced error. In the case when assigning all at-risk individuals might impose excessive error, an error budget can be used to specify the maximum error to be introduced, and we can determine the optimal transition probabilities for assigning as many at-risk individuals to new locations as possible within the error budget. Relocating individuals for privacy protection could alter the demographic compositions across different geographic areas. We examine the impacts on demographic diversity later where computational experiments are discussed.

Methodology
The proposed computational framework is illustrated in Figure 5. An individual-level data set (X) is first queried to produce aggregated data (Y), and their global disclosure risks are evaluated. If the individual-level or aggregated data have a high disclosure risk, spatial optimization models will be used to determine optimal transition probabilities for assigning at-risk individuals to create the transformed individual-level data (X), which can then be queried to generate the transformed aggregated data (Ỹ). Measures of disclosure risks and data utility are then used to evaluate the transformed data.
We formally define the transition probability, h kij , as the probability of assigning an individual counted in X who satisfies the criteria in predicate k from location i to j. After individuals are assigned based on the transition probabilities, the original individual-level matrix will be transformed into a new matrix. We denote the expectation of the transformed individual-level data asX ¼ fx ki g, wherex ki is the expected number of individuals in location i who satisfy the criteria in predicate k. The value of x ki is calculated as the sum of expected numbers of nonassigned individuals in i (h kii x ki ) and individuals assigned from all other locations ( P n j¼1, j6 ¼i h kji x kj ): X can be queried to generate the expectation of the transformed aggregated data asỸ ¼ AX ¼ fỹ k 0 i g, whereỹ k 0 i is the expected number of individuals in location i counted using counting query k 0 : In practice, the actual transformed data can be obtained by drawing a realization from the transition probabilities. When we refer to transformed data in this article, however, we mean the expectation (X orỸ)  Figure 2, consider two one-elements in block 1100: (A) person 14 who is the only individual satisfying predicate (Single, < 15), and (B) person 13 who satisfies predicate (Single, > 64). These two individuals need to be protected and assigned to candidate blocks because they have a high disclosure risk of one. Three possible assignments can be made for each individual, as illustrated by the three arrows in each diagram. We use transition probabilities to describe the likelihood of each assignment.
rather than a specific realization, because the former provides an anticipated average of all possible transformed data given certain transition probabilities.

Disclosure Risks
Assuming the transformed individual-level dataX can be reconstructed, the individual disclosure risk for each person counted in the transformed matrix elementx ki can be estimated as the ratio between the expected number of this individual not being assigned (h kii ) and the population count represented byx ki , as adapted from Shlomo (2014): The global disclosure risk for the transformed individual-level data, which indicates the risk of reconstruction-abetted disclosure after assignment, can be estimated by averaging the individual disclosure risk (p ki ) over all elements ofX: When no individuals are assigned to other locations, we have h kii ¼ 1, andX is identical to the original data X (x ki ¼ x ki ). The individual disclosure risk for each person counted in element x ki in the original data can be estimated as The global disclosure risk for the original individual-level data can be formally defined as Because we always have p Ã ki ! p ki and s Ã ! s, we call p Ã ki and s Ã the worst-case individual and global disclosure risks, respectively. They provide a baseline to analyze the disclosure risks.
Not all at-risk individuals are necessarily assigned to other locations, depending on whether an error budget is used and its value. In the case not all atrisk individuals are assigned, a true one-element might exist when there is a population count of one before and after transformation. The presence of true one-elements is a particular privacy concern about geographically aggregated data (Bethlehem, Keller, and Pannekoek 1990;Skinner and Shlomo 2008;Young, Martin, and Skinner 2009). A true one-element exists when h kii ¼ 1 andx ki ¼ 1, and the individual disclosure risk for the person counted in this element is at its maximum value of one. To measure the disclosure risks with respect to the true one-elements, we define a second risk measure called population uniqueness rate as the probability of finding a true one-element in the transformed individual-level dataX: where IðÁÞ is an indicator function that returns one if its input condition is true and zero otherwise. When no individuals are assigned to other locations, all one-elements in the original data X are maintained, and we have a worst-case population uniqueness rate: Iðx ki ¼ 1Þ: These measures indicate the risks of reconstruction-abetted disclosure by examining the individuallevel data (X and X). To assess the risks of direct disclosure, we can develop similar risk measures for aggregated data (Ỹ and Y, see Appendix A).

Data Error and Utility
Data utility is often measured based on its inverse concept, data error, which is the discrepancy between the transformed and original data (Domingo-Ferrer and Torra 2001;Shlomo and Young 2006). The utility of data decreases as the amount of error introduced increases. When individuals are assigned to new locations, error is introduced to the matrix elements involved in the assignment. We commonly measure the error using the rate of absolute discrepancy, but this rate can become infinite and undefined when the original element is zero. A way of resolving this is to employ a user-defined large value (e.g., 50) as a penalty to the infinite error (Kim and Kim 2016). For the individual-level data, the penalized error of a matrix element is computed as The penalized error (d ki ) might seem to be cumbersome due to the use of the penalty value, but it has irreplaceable convenience in the formulation and solving of the optimization models (more details in the next subsection).
The utility of the transformed data can be evaluated by averaging the error over all matrix elements. To avoid the impact of penalty values, the error here uses the sum of the original and transformed matrix elements as the denominator rather than the original element alone: We refer to the average of this error as the symmetric mean error (SME). For the transformed individual-level dataX, the SME is A similar measure can be computed for the transformed aggregated dataỸ (Appendix A). Values of the SME range from zero to one, with a higher SME indicating more error introduced (Makridakis 1993).

Spatial Optimization Modeling
Spatial optimization models are developed to determine the optimal transition probabilities (h kij ) of assigning at-risk individuals. To do so, it is critical to determine feasible candidate locations (i.e., locations indexed by j) to which at-risk individuals (at location i) can be assigned for protecting their privacy. We say that a candidate location covers an at-risk individual if it is a feasible location to accommodate the at-risk individual. The feasibility is determined by two important factors. First, to avoid overcrowding, each candidate location has a userspecified capacity, which is the maximum number of individuals that can be assigned to this location. More important, we consider two principles to model where the feasible locations can be.
The first principle, called free assignment, provides the most relaxed way for assigning individuals at risk in geography. This principle only prevents persons from being assigned to their original locations, as well as candidate locations that have at-risk individuals who satisfy the criteria in the same predicates. Applying this principle means that an individual can be assigned to any other candidate locations, including those distant from his or her original location. The use of the free assignment principle could disrupt the spatial pattern of data because individuals might be moved to locations that are too far away from their original locations (Wieland et al. 2008). To preserve the spatial pattern of data after assigning individuals to candidate locations, we propose a second coverage principle, called spatial proximity, to take into account the spatial relations between individuals at risk and their candidate locations. A candidate location can be considered feasible if it is his or her neighboring location and meets the conditions specified by the free assignment principle. An effective way to determine neighboring locations is to create a buffer centered at the original location of each individual using a prespecified distance, and then consider candidate locations within the buffer as neighboring locations. The k-nearest neighbors (k-NN) approach is another way to account for spatial proximity, where the k nearest candidate locations of each individual are deemed as neighboring locations.
To formulate the spatial optimization models, users must also determine the k-elements in the individual-level data by specifying the value of k. Individuals counted in k-elements are deemed at risk and need to be assigned to candidate locations. An error budget u is also needed if one wants to limit the maximum error to be introduced into the data. The following are the parameters and variables of the optimization models. All notation used in this article is summarized in Appendix B. Specifically, w kij can be computed using the penalized error in Equation 9 as Calculating w kij in this manner allows us to compute the total error introduced when formulating optimization problems. Because the denominators in Equation 12 are constant, we can ensure that the formulated optimization problems are linear and can be solved efficiently using existing mathematical optimization solvers.

Minimum Error with Full Protection
In the first model, we assign all at-risk individuals to candidate locations that cover them while minimizing the total error introduced, assuming no upper limit on the amount of introduced error. We refer to this model as the minimum error model (MEM). The basic form of the MEM extends two well-known spatial optimization problems: set covering problem (Murray 2005;Murray, O'Kelly, and Church 2008) and assignment problem (Munkres 1957;Lawler 2001). Instead of optimizing binary decision variables to determine whether or not an individual will be assigned to a candidate location, however, we optimize the transition probabilities such that the probability of each possible assignment can be decided. We formulate the MEM as subject to X 0 h kij 1 8k, i, j: Equation 13 minimizes the total penalized error introduced to the individual-level data. The constraints in Equation 14 state that for each at-risk individual, the probabilities of all possible assignments sum to one. This ensures that each at-risk individual will be assigned to one and only one candidate location. The constraints in Equation 15 ensure that at-risk individuals will only be assigned to locations that cover them. The constraints in Equation 16 specify that the expected number of individuals assigned to a candidate location does not exceed its capacity. The constraints in Equation 17 specify the range of the decision variables. For the purposes of demonstration, we apply the MEM to determine the transition probabilities in Figure 4, and the results are presented in Appendix C.

Maximum Protection with an Error Budget
Assigning all individuals at risk to feasible candidate locations as specified in the MEM might introduce excessive error. As a result, a prespecified budget for the maximum error allowed is often needed in actual applications (Van Riper, Kugler, and Ruggles 2020). We therefore develop a second optimization model that maximizes the number of at-risk individuals being assigned given the maximum error to be allowed (denoted as u), and this model is referred to as the maximum protection model (MPM). The MPM extends both maximum coverage problem (Church and ReVelle 1974) and assignment problem (Munkres 1957;Lawler 2001), again using probabilities as decision variables instead of binary decision variables. The MPM is formulated as 0 h kij 1 8k, i, j: Equation 18 maximizes the expected number of individuals assigned to locations. The constraints in Equation 19 are equivalent to those in Equation 14, which ensure that the probabilities of all possible assignments for each at-risk individual sum to one. The constraints in Equation 20 specify that at-risk individuals must either stay at their original locations or be assigned to candidate locations that cover them. The constraints in Equation 21 are equivalent to those in Equation 16 in the MEM, which specify that the expected number of individuals within coverage assigned to a candidate location does not exceed its capacity. The constraint in Equation 22 ensures that the total penalized error introduced is below the budget. The constraints in Equation 23 specify the range of the decision variables. The transition probabilities in Figure 4 determined by the MPM are presented in Appendix C.

Computational Experiments
To test the proposed computational framework in practical applications, individual-level data sets with varying population sizes and geographic units are required. We selected Franklin County and Guernsey County in Ohio as the study areas for our experiments. Franklin County is the most populous county in Ohio, with a total of 284 census tracts and a population of 1,163,414 as of the 2010 U.S. Census. Guernsey County, in comparison, is one of the least populous counties in Ohio. As of the 2010 U.S. Census, Guernsey has a population of 40,087 in its ten census tracts.
The U.S. Census Bureau collects individual-level data that consist of demographic attributes and geographic locations of all residents in Franklin and Guernsey Counties. Such data, however, are not accessible to the general public due to confidentiality constraints. Therefore, synthetic individual-level data for both counties are employed in this study (Lin and Xiao 2022). The synthetic data are generated based on the 2010 U.S. Census Summary File 1 (SF1), an enumeration of all U.S. residents. The data include all 1,163,414 individuals in the 284 census tracts of Franklin County, as well as all 40,087 individuals in the ten census tracts of Guernsey County. Each individual has three attributes: voting age (V), ethnicity (E), and race (R). A breakdown of these attributes is shown in Appendix D. To examine the validity of the synthetic data, we compare them to the original SF1 data and a data set known as the American Community Survey Public Use Microdata Sample (ACS PUMS) that comprises a 5 percent random sample of the true individual-level data. Our results show that the synthetic individual-level data have a correlation coefficient of one with the 2010 U.S. Census data and a correlation coefficient of 0.99 with the ACS PUMS data. These findings imply that the synthetic data well represent the population for both counties.
For each county, the synthetic data are converted into an individual-level data matrix (named VER) using twenty-eight feasible predicates (m ¼ 28; more details in Appendix D, Table D.1). VER has all the census tracts as geographic units (n ¼ 284) and all three attributes of voting age, ethnicity, and race in the synthetic data. The number of elements in the individual-level data matrix (N) is the product of m and n, which equals 7,952. This individual-level matrix can be queried to generate multiple aggregated data matrices, and we select two: an aggregated data matrix involving only ethnicity and race (named ER), and an aggregated data matrix involving only race (named R). ER and R are generated using the counting queries in Appendix D, Table D.2 and Table D.3, respectively.
We compute the worst-case global disclosure risks (s Ã ) and population uniqueness rates (/ Ã ) for the individual-level (VER) and aggregated (ER and R) data, and the results are shown in Table 1. For Franklin County, the VER has a s Ã value of 0.167. This means that if the original individual-level data can be reconstructed, given that Franklin County has a population of over 1 million people, approximately 200,000 individuals are at risk of being reidentified. In addition, the ER has a s Ã value of 0.157, which suggests that more than 100,000 individuals can be reidentified through direct disclosure even after the original data are aggregated. The value of / Ã for the VER is 0.086. Because VER contains N ¼ 7,952 elements, more than 700 of them are one-elements, and individuals who contribute to these one-elements can be uniquely reidentified if the original individual-level data are reconstructed. For Guernsey County, because of its small population size in each geographic unit, the values of s Ã and / Ã are higher than those for Franklin County, indicating a higher possibility of reidentification through direct or reconstruction-abetted disclosure. These results suggest that if the original data are released without protection, individual privacy will be jeopardized, regardless of whether the data are aggregated (ER or R) or not (VER). Spatial optimization models will be used to reduce the risks of disclosure for the synthetic data.

Impacts of Input Parameters
The spatial optimization models require several input parameters to be specified by users, including the capacity r j , the penalty value for w kij , the coverage t kij , and the integer k for k-elements. In addition, the error budget u is a user-specified parameter for the MPM. We examine how these parameters affect the outcome of the models in terms of privacy protection and data utility. In our experiments, the capacity is fixed at ten for all candidate locations, and the penalty value for w kij is set to fifty. The coverage t kij is determined using two principles, free assignment and spatial proximity, where a k-NN approach with k ¼ 9 is used for the spatial proximity principle. Table 2 summarizes the parameters and their values for the computational experiments. The experiments were carried out on a computer with Intel Core i7-6500U (2.5 GHz) and 8 GB RAM. A mathematical optimization solver called Gurobi (Gurobi Optimization, LLC 2021) was used to find optimal solutions to the two spatial optimization models given different values of the parameters. For our test setups, on average, the MEM can be solved in forty seconds of processing time, and the MPM can be solved in ninety seconds. Figure 6 shows the global disclosure risks (s) of data transformed using the MEM. Raising the k value increases the number of individuals who are considered at risk, which reduces the risk of disclosure for each individual and leads to the decrease of s. In general, s also decreases as the number of attributes decreases (e.g., from VER to ER to R). This is because as the number of attributes decreases, the value of each matrix element tends to increase, which reduces the probability of any individual being disclosed and thus reduces the global disclosure risks (s). In addition, Guernsey County has higher s values compared to Franklin County because of its higher worst-case global disclosure risks (s Ã ) for the original data (Table 1). It is also observed that the use of the two coverage principles-free assignment and spatial proximityhas no discernible effect on the resulting value of s. This is particularly the case for Guernsey County, which has a small number of geographic units (ten census tracts), and spatial proximity and free assignment tend to result in the same set of feasible locations to assign each individual. For both counties, the reductions in global disclosure risk are more than 50 percent compared to the worst case, indicating the effectiveness of the MEM. The other risk measure / (population uniqueness rate) is zero for all data transformed using the MEM because we assign all individuals contributing to the original k-elements (including all one-elements) to different locations. Figures 7 and 8 present the disclosure risks of data transformed using the MPM. As we increase the error budget (u), the number of at-risk individuals assigned to different locations generally increases, which reduces the values of both s and /: We also find that the disclosure risks for Guernsey are generally higher than those for Franklin, which is consistent with the results demonstrated in Figure 6. In addition, for Franklin County, both s and / decrease when the coverage principle changes from spatial proximity to free assignment. This is because when the free assignment principle is used, an increasing number of candidate locations are considered feasible for each atrisk individual, and it is likely to assign at-risk individuals to locations that incur less error. Given a fixed error budget, changing from spatial proximity to free assignment will lead to an increase in the number of individuals assigned to different locations and thus decreasing values for s and /: Table 3 presents the minimum penalized error required to assign all at-risk individuals (Equation 13). A minimum penalized error is derived from the MEM and equals the value of its objective function. For Guernsey County, the minimum penalized error does not differ between the two coverage principles due to its small number of geographic units. For Franklin County, when using free assignment as the coverage principle, the minimum penalized error   , 0.5 N, N, 1.5 N, 2 N decreases compared to that using the spatial proximity principle. This is because the free assignment principle allows for an increasing number of feasible locations for each individual to be assigned to, and it is likely that each individual is assigned to a location that incurs smaller error. It is also noted that when the coverage principle is fixed, as the k value decreases, the minimum penalized error decreases. This suggests that when we reduce the k value, the number of individuals considered to be at risk decreases, which reduces the error introduced by assigning at-risk individuals. Figure 9 shows the SME of data transformed using the MEM. For the majority of our test setups, the SME is below 0.25, which is considered satisfactory in many applications (Chang et al. 2007;Wang and Wang 2017;Gatabazi, Mba, and Pindza 2019). Specifically, the SME is generally small when using a low k value, which is consistent with our findings for the minimum penalized error. The SME for Guernsey County is generally higher than that for Franklin County, because Guernsey County has fewer individuals in each geographic unit, resulting in a larger error when assigning individuals to different locations. It is also found that the SME of aggregated data R is the smallest among the three sets of attributes examined. This is because such data do not contain as many small population counts as the other data, and the absolute difference relative to the original data, as measured by the SME, will generally be small. Figure 10 presents the SME of data transformed using the MPM. When the error budget (u) is below the minimum penalized error required to assign all at-risk individuals (Table 3), SME for the MPM is lower than that for the MEM. Otherwise, the SME is likely to be higher than that for the MEM. Therefore, it is recommended that before implementing the MPM, the MEM should be applied first, with its objective value (minimum penalized error required to assign all at-risk individuals) serving as a guide to determining which model to use, as well as how to set the budget u if using the MPM. It is also observed that the SME follows an opposite trend as the disclosure risks in Figure 7. When using a low k value and a small error budget u, the SME is generally small, but at the cost of high risks of disclosure.

Impacts of Coverage Principles on Spatial Autocorrelation
We here examine how two coverage principles, free assignment and spatial proximity, affect the spatial pattern of the transformed data. One way of assessing the effect on spatial patterns is to study changes in spatial autocorrelation, because spatial autocorrelation indicates the relationships between a geographic variable and its surroundings (Getis 2010). The local indicators of spatial association (LISA; Anselin 1995) is used as a measure of spatial autocorrelation. The statistically significant LISA values can be classified into four patterns: statistically significant clusters of high values (HH), clusters of low values (LL), outliers with high values surrounded primarily by low values (HL), and outliers with low values surrounded primarily by high values (LH). We use the percentage of non-Hispanic American Indian and Alaska Native (AIAN) population in each census tract of Franklin County to illustrate how the four types of LISA patterns (HH, LL, HL, and LH) change between the original and the two transformed data (under two coverage principles). Figure 11 shows the impact of different coverage principles on changes in LISA patterns for the percentage of the non-Hispanic AIAN population. When spatial proximity is used as the coverage principle, a total of eight tracts that have a significant LISA pattern become insignificant. When free assignment is used, however, there are ten tracts that change from significant to insignificant and eight that change from insignificant to significant or change the type of pattern. Other attributes not examined here demonstrate similar trends to the results presented. These findings suggest that the use of the spatial proximity principle can help us to better preserve the spatial autocorrelation of data after assigning individuals to new locations.

Comparative Analysis with the Census TopDown Algorithm
We compare the results from the spatial optimization models with those from the TopDown algorithm ) used for disclosure avoidance in  the 2020 U.S. Census. The TopDown algorithm uses a privacy definition known as concentrated differential privacy (Bun and Steinke 2016), a practical variant of differential privacy, to introduce statistical noise to census data so that altering a specific individual's record has no discernible effect on the aggregated census data. The National Historical Geographic Information System provides the 2010 Census Privacy-Protected Demonstration Data (PPDD) that is processed through the TopDown algorithm with production parameters for the 2020 Census Redistricting Data (Van Riper, Kugler, and Schroeder 2020). We use the PPDD in this subsection to support our comparison. We evaluate the privacy and utility of data transformed using the spatial optimization models and the TopDown algorithm. We use the population uniqueness rate (/) as the privacy measure, as in many other studies comparing disclosure avoidance methods (Almasi et al. 2016). The global disclosure risk (s) for the PPDD cannot be directly computed because the transition probabilities required to  calculate s (see Equation 4) are not available. We assess data utility using the error (d 0 k 0 i , see Equation A.6 in Appendix A) for each geographic unit i introduced due to privacy protection. We use the census tract-level count of non-Hispanic AIAN population (answers to counting query 5 in Table D.2; k 0 ¼ 5) to illustrate the differences in privacy and utility of data transformed between the two types of methods.
We examine the MEM with k ¼ 1 that assigns all individuals counted in one-elements to different locations. The / value for data transformed using this model is zero because all one-elements in the original data are assigned to different locations. The PPDD in Franklin County contain twenty-one one-elements, and the PPDD in Guernsey County contain zero one-elements. Among the twenty-one one-elements for Franklin County, four are true one-elements, by comparing the PPDD to the count of non-Hispanic AIANs in the 2010 U.S. Census SF1 that contains a 100 percent sample of the population. Therefore, with respect to non-Hispanic AIANs, the / value for PPDD in Franklin County is 4/21 ¼ 0.19, and the / value for PPDD in Guernsey County is 0. The results suggest that, in terms of the population uniqueness of non-Hispanic AIANs, the MEM provides privacy protection that is better than or at least equivalent to the differential privacy-based TopDown algorithm. Other populations not presented in this subsection show similar results because the / value associated with the MEM is always zero, whereas that associated with the TopDown algorithm can be above zero. Relatively large error (d 0 k 0 i ) is introduced to tracts with small numbers of non-Hispanic AIANs. For the TopDown algorithm, this is due to the spatially invariant noise injected to ensure differential privacy. For the MEM, this is because non-Hispanic AIANs in tracts with small counts tend to be considered at risk and are relocated to homogeneous areas (i.e., areas with a high concentration of non-Hispanic AIANs) to protect their privacy. The error introduced by the MEM, however, is generally lower than that by TopDown in most census tracts of Franklin and Guernsey Counties, as illustrated by the maps in the middle of each row of Figure 12, and especially Figures 12D and  12H, where the red symbols (error introduced by the MEM) tend to be placed lower than the blue symbols (error introduced by TopDown). Therefore, when compared to the differential privacy-based TopDown algorithm, the MEM improves privacy protection in terms of population uniqueness while providing better data utility.

Effects on Demographic Diversity
To achieve privacy protection, it is often necessary to involve relocating individuals, either directly or indirectly. Our methods, for instance, move individuals from their original geographic units to units where their identities are unlikely to be disclosed. Methods based on differential privacy produce the same outcomes as relocating individuals by adding random noise to increase or decrease the population count in each geographic unit while maintaining the total population of all geographic units. Compared with other demographic groups, members of minority populations are more likely to be at risk of disclosure and to be relocated for privacy protection. One could therefore raise the question of how such relocation affects the demographic diversity in the data. Here we investigate how the use of spatial optimization models affects demographic diversity. We apply a score adapted from Shannon's Entropy (Massey, White, and Phua 1996) to measure the racial and ethnic diversity in the original and transformed data (ER). Let q be the number of racial and ethnic groups in the data (in our case, q ¼ 14; see Table D.2) and W r be the percentage of a racial and ethnic group r (e.g., non-Hispanic AIANs) in a geographic unit. The entropy score of a unit can be calculated as The entropy score ranges from 0 to log ðqÞ: A value of log ðqÞ indicates that all racial and ethnic groups are equally common, hereby denoting the highest level of diversity. A value of 0 denotes that the entire population is dominated by one racial and ethnic group and all other groups have zero counts, hereby denoting the lowest level of diversity. When we compare the demographic diversity between the original and transformed ERs, their entropy scores are first calculated, and then we compute their relative difference by dividing the absolute difference of the entropy scores by their sum. The literature that employs the relative difference categorizes its examination results into four types: excellent (< 0.1), good (0.1-0.2), acceptable (0.2-0.5), and unacceptable (> 0.5; Chang et al. 2007;Wang and Wang 2017;Gatabazi, Mba, and Pindza 2019). We use these categories to evaluate our results. When the relative difference falls in the excellent and good categories, the difference in entropy scores is less than 20 percent of the original entropy score, and we typically consider the impact on demographic diversity to be relatively small. We use the MEM with k ¼ 1 to obtain the transformed ER. As illustrated in Figure 13, the relatively large difference in demographic diversity typically occurs in areas where the original entropy scores are relatively low. This is potentially because areas with low demographic diversity tend to have an increased number of individuals considered minorities that are at risk and need to be protected (relocated). Importantly, the majority of the tracts in Franklin County have relative differences lower than 0.1 and the others are between 0.1 and 0.2, and all tracts in Guernsey County are below 0.1. Therefore, according to the existing literature on relative differences, our methods do not have a significant impact on the distribution of different racial and ethnic groups in each geographic unit. Because population data by ethnicity and race are not available from the census, we cannot directly compare the impacts on demographic diversity between our methods and methods based on differential privacy. Findings from the previous subsection, however, suggest that our methods introduce less error to minority populations (e.g., non-Hispanic AIANs) compared to methods based on differential privacy. Our methods hence should impose a smaller impact on demographic diversity than differential privacy-based methods.

Discussion and Conclusions
The past decade has seen tremendous progress in location-based technologies that leads to the production of large-scale geospatial data sets with individual location information. Many of these data sets are aggregated for public use, but the aggregated data continue to raise serious public privacy concerns (Curtis et al. 2011). We develop a computational framework in this article to protect the privacy of geographically aggregated data while retaining data utility. Different from conventional disclosure avoidance methods such as data suppression, the proposed framework provides quantitative measures to evaluate both disclosure risks and utility of the data released. Making both risks and utility transparent is important to establish trust among data contributors, producers, and users. It helps reduce the risks of disclosure subjectively perceived by individual data contributors, and at the same time assists users to understand how data might deviate from what was collected due to privacy protection. The computational framework also includes spatial optimization models for controlling both risks and utility of geographically aggregated data, which ensure maximum protection or minimum error given different data utility requirements. This can avoid introducing excessive error into data for privacy protection, as in other methods such as Figure 13. Effects of relocating individuals on demographic diversity. (A) and (D) Entropy score calculated using the original ER in Franklin and Guernsey Counties, respectively. (B) and (E) Relative difference between entropy scores of the original and transformed ERs in Franklin and Guernsey Counties, respectively. (C) and (F) Relationship between the original entropy score and the relative difference in Franklin and Guernsey Counties, respectively. differential privacy. As shown in the experimental results, the disclosure risk of data can be reduced to 82 percent of its original value while maintaining data utility (SME lower than 0.25). The models can also be effectively used to maintain a higher level of utility by setting the error budget while maximizing protection. In a broad context, providing quantitative measures of risks and utility, as well as spatial optimization models to control both, aids in communicating and addressing the trade-off between privacy and utility in geographic data releases, which promotes ethical geographic data gathering, use, and sharing.
Utility is a key concept in this article, but it is also ill-defined with varying interpretations based on cultural, historical, and social contexts. We quantify utility by calculating the error introduced. There can be additional definitions of utility, depending on the nature of the desired data. For example, in some cases where spatial autocorrelation or demographic diversity are critical and should be quantified and preserved, they can be used as one of the utility metrics in the optimization models. It should be noted that the various definitions of data utility might conflict with each other and cannot be maximized simultaneously. One example demonstrated in this article is that, although spatial proximity as a coverage principle can be used to preserve the spatial autocorrelation of the transformed data, it can increase the SME values. In such cases, no single solution exists that simultaneously optimizes both the spatial autocorrelation and the SME. A potential future direction is to develop multiobjective optimization models that account for multiple conflicting objectives (Deb 2001;Xiao, Bennett, and Armstrong 2007). In this way, it is promising that we can include all necessary definitions of data utility as objectives and find a set of solutions in which no objective can be improved without degrading others. These solutions will all be considered and evaluated in data production.
The computational complexity of spatial optimization models is determined by the size of the optimization problems, specifically the scales of geographic units within and between which individuals are to be assigned. In this study, individuals are assigned within a county and between different tracts. If we fix the geographic scales within and between which individuals are to be assigned, the models can be generalized to different geographic areas (beyond Franklin and Guernsey Counties) using parallel computing without having to increase the size of the optimization problems. As we expand the extent from a county to a state, or assign individuals to different blocks rather than tracts, however, the problem size tends to increase, which can significantly raise the computational demands that cannot be met efficiently with the exact methods used in this study. Heuristic methods, such as evolutionary algorithms (Xiao, Bennett, and Armstrong 2007;Tong, Murray, and Xiao 2009;Xiao and Murray 2019), are promising candidates for finding optimal or near-optimal solutions to large optimization problems and should be investigated in the future.
Disclosure avoidance has started to be applied to aggregated data in both the public and private sectors, but several problems still need to be resolved. For example, the U.S. Census Bureau employs methods from data suppression to differential privacy to avoid disclosure, but data users are concerned that they are not fully informed about how these methods compromise data utility (Mervis 2019b). In addition, private companies such as Meta (formerly Facebook) are hesitant to implement disclosure avoidance because doing so will introduce error and affect subsequent data analytics (Mervis 2019a). This research provides a way to communicate utility through quantitative measures, as well as a modeling framework for disclosure avoidance that minimizes or controls the introduced error. These will help address concerns about how disclosure avoidance will affect data utility and have the potential to promote the practical applications of disclosure avoidance. There has been promising progress in the literature to ensure differential privacy while achieving high data utility through improved anonymized data-collection schemes (Sei and Ohsuga 2017;Murakami and Kawamoto 2019), microaggregation (Soria-Comas et al. 2014), and relaxations of differential privacy (Soria-Comas et al. 2017). Methods have also been developed to quantify the risks of reidentification for differentially private data (Sei, Okumura, and Ohsuga 2022). A combination of our methods and these research outcomes has the potential to further improve data utility while maintaining high levels of privacy protection.
Computational Framework for Preserving Privacy and Maintaining Utility of Geographically Aggregated Data 1053