Imposing Minimax and Quantile Constraints on Optimal Matching in Observational Studies

ABSTRACT Modern methods construct a matched sample by minimizing the total cost of a flow in a network, finding a pairing of treated and control individuals that minimizes the sum of within-pair covariate distances subject to constraints that ensure distributions of covariates are balanced. In aggregate, these methods work well; however, they can exhibit a lack of interest in a small number of pairs with large covariate distances. Here, a new method is proposed for imposing a minimax constraint on a minimum total distance matching. Such a match minimizes the total within-pair distance subject to various constraints including the constraint that the maximum pair difference is as small as possible. In an example with 1391 matched pairs, this constraint eliminates dozens of pairs with moderately large differences in age, but otherwise exhibits the same excellent covariate balance found without this additional constraint. A minimax constraint eliminates edges in the network, and can improve the worst-case time bound for the performance of the minimum cost flow algorithm, that is, a better match from a practical perspective may take less time to construct. The technique adapts ideas for a different problem, the bottleneck assignment problem, whose sole objective is to minimize the maximum within-pair difference; however, here, that objective becomes a constraint on the minimum cost flow problem. The method generalizes. Rather than constrain the maximum distance, it can constrain an order statistic. Alternatively, the method can minimize the maximum difference in propensity scores, and subject to doing that, minimize the maximum robust Mahalanobis distance. An example from labor economics is used to illustrate. Supplementary materials for this article are available online.


Introduction: Avoiding Poor Matches When
Optimizing the Typical Match

Background
In observational studies, matching is used to construct a control group that resembles the treated group in terms of measured covariates. Modern methods solve a discrete optimization problem subject to constraints, minimizing the sum of within-pair covariate distances subject to constraints that ensure covariate balance in marginal distributions; see Section 2 for an example. Minimizing a sum or mean of within-pair differences subject to constraints often produces good matched samples. Nonetheless, a mean over thousands of pairs does little to constrain the worst pairs, so minimizing the mean pair difference is consistent with having dozens of poorly matched pairs. The current article discusses a general algorithm for appending one or more minimax constraints, or constraints on high-order statistics, thereby avoiding poorly matched pairs to the extent that this is feasible given the data available. The new general algorithm in Section 4 permits many useful variations on the theme of constraining the worst to be tolerable before optimizing the mean. For discussion of optimal matching in observational studies, see Rosenbaum (1989, Hansen and Klopfer (2006), Hansen (2007), Stuart (2010), Lu et al. (2011), Yang et al. (2012), Zubizarreta (2012), and Pimentel et al. (2015), and see R packages optmatch, nbpMatching, and rcbalance at CRAN, finebalance archived but working at CRAN, and mipmatch available from Zubizarreta (2012). For applications of optimal matching, see Silber et al. (2013Silber et al. ( , 2014 and Neuman et al. (2014).

A Toy Illustration: Three Treated Subjects and Three Controls
Figure 1 is a toy example that will be used to discuss several issues. Figure 1 has three treated subjects, τ 1 , τ 2 , τ 3 , and three controls, γ 1 , γ 2 , γ 3 , and the task is to select one of the 3! = 6 possible pairings of treated and control subjects. There is a distance between each τ t and each γ c reflecting how similar τ t and γ c are with respect to measured covariates. A modern distance is the sum of a robust Mahalanobis distance plus a penalty function that prevents poor matches for a propensity score, and this distance is used in the empirical example in Section 2; see Rosenbaum (2010, chap. 8). However, in the toy example in Figure 1, the form of the distance is not a consideration. The dashed lines in Figure 1 are distances of 100, that is, very poor matches. The heavy, solid, horizontal lines are distances of 1. The light solid lines are marked with their distances of 2, 0, and 0. A greedy or best-first or nearest-available matching algorithm would start with the smallest available distance, perhaps Figure . Toy example with three treated subjects, τ 1 , τ 2 , τ 3 , and three controls, γ 1 , γ 2 , γ 3 , who may be matched in ! =  possible ways. The lines or edges connecting treated and control subjects have costs, which measure how similar individuals are in terms of measured covariates. Dashed lines indicate a large distance of , heavy solid lines indicate a distance of , and light solid lines have costs indicated by their numeric labels of , , and .
forming the matched pair (τ 3 , γ 2 ) for a cost of 0. That would be fatal mistake. If a greedy algorithm played chess, it would capture a piece whenever it could, with no thought to what comes next, and it would play terrible chess. In Figure 1, starting with matched pair (τ 3 , γ 2 ) at cost 0, the greedy algorithm would pair (τ 1 , γ 1 ) for a cost of 1, and would be forced to accept (τ 2 , γ 3 ) for a cost of 100, so the match has total distance 0 + 1 + 100 = 101 and maximum distance 100. If Figure 1 were changed by increasing the cost of the dashed lines, the greedy solution could be made to have both a total and maximum distance that is arbitrarily large, that is, extremely poor, when compared with optimal matching.
In Figure 1, there are two pairings that minimize the total cost over the 3! = 6 possible pairings. One optimal solution pairs (τ 1 , γ 3 ) for a cost of 2, (τ 2 , γ 2 ) for a cost of 1, and (τ 3 , γ 1 ) for a cost of 0, making a total distance of 2 + 1 + 0 = 3 and a maximum distance of 2. The other optimal solution pairs (τ 1 , γ 1 ) for a cost of 1, (τ 2 , γ 2 ) for a cost of 1, and (τ 3 , γ 3 ) for a cost of 1, making a total distance of 1 + 1 + 1 = 3 and a maximum distance of 1. Typical matching algorithms minimize the total distance, perhaps with additional balance constraints, and these typical algorithms would be indifferent between the two optimal solutions just described, because they have the same optimum, that is, the user would simply receive one of the optimal solutions picked arbitrarily. In contrast, the most straightforward application of the algorithm in the current article would have picked the solution (τ 1 , γ 1 ), (τ 2 , γ 2 ), and (τ 3 , γ 3 ), thereby having a maximum distance of 1. Figure 1 is a toy example with three pairs. In problems of realistic size, minimizing the total distance is consistent with having some pairs that have very large distances.
The proposed new algorithm combines elements of two problems: in its simplest form, it minimizes the maximum distance, and subject to the constraint that it does that it minimizes the total distance. In Figure 1, a first step of the algorithm would discover that, among the six possible pairings, the pairing with the smallest maximum distance has a maximum distance of 1. In Figure 1, only one of the six possible pairings has a maximum distance of 1, namely, (τ 1 , γ 1 ), (τ 2 , γ 2 ), and (τ 3 , γ 3 ), that is, there is only one pairing in Figure 1 that minimizes the maximum distance. Typically, in problems of realistic size, there will be a vast number of matchings that minimize the maximum distance, and the algorithm in the current article would pick the best of these, the one that minimizes the total distance subject to the constraint of minimizing the maximum distance.
The algorithm in the current article first discovers that it is possible to match everyone in Figure 1 with a distance of at most 1; then it removes all edges with a distance greater than 1, and minimizes the total distance in the sparser network that remains. In Figure 1, it was immediately obvious that we should avoid the dashed lines if possible, and the new algorithm discovers that it is possible and removes them very early in its activity, also removing the edge with a distance of 2. In a large problem, if one removed all edges with a distance greater than some number d, then one might pick the wrong value of d, either removing so many edges that no pairing exists, or else tolerating distances that are much larger than needed. The algorithm in the current article finds the smallest d that permits a match to be constructed. In general, the algorithm in the current article may increase the total distance by constraining the maximum, but the increase is zero in Figure 1 and small in the example in Section 2; moreover, the user may construct both matches and compare them, as in Figure 1 and the example in Section 2.
In Figure 1, the minimax constraint and the total cost refer to the same distance; however, this is not essential. For instance, the minimax constraint might refer to a distance that involves a few key covariates or scores derived from them. Subject to the minimax constraint, one minimizes a total distance that may include many more covariates.
It can happen that even with a minimax constraint, the optimal match must tolerate a few large distances, so the minimax constraint may be insufficiently constraining on its own. In this case, it may be useful to eliminate edges larger than the minimax cut, then additionally impose another constraint on a high quantile, say the 99% quantile; see Section 4.5.
Investigators sometimes match "with replacement, " allowing matched sets to overlap, and Figure 1 illustrates how inefficient that can be. For illustration, suppose that, unknown to the investigator, the matching variables are irrelevant, responses in the treated group are N (θ, 1) and in the control group are N (0, 1), and all responses are independent. Then, for all 6 = 3! pair matchings, the treated-minus-control difference in means is Normal with expectation θ and variance 1/3 + 1/3 = 2/3. Matching-with-replacement to all nearest controls in Figure 1 would: (i) form matched sets (τ 1 , γ 1 ), (τ 2 , γ 2 ), and (τ 3 , γ 1 , γ 2 ), discarding γ 3 , (ii) compute the average over the three sets of the treated-minus-average-control response in each set, (iii) yield an estimate that is Normal with expectation θ and variance 1/3 + 1/2 = 5/6, which is (5/6) / (2/3) = 1.25 or inflated by 25%. The nine distances in Figure 1 have an average value of 305/9 = 33.89. The greedy match above has an average within matched-pair distance of 101/3 = 33.67, or 99.3% of the overall average. The optimal pair match has an average within-pair distance of 3/3 = 1 or 0.3% of the overall average. Matchingwith-replacement has a weighted average within pair distance of (1 + 1 + 0) = 2/3 or 0.2% of the average. In other words, matching-with-replacement reduced the within pair distance from 0.3% of the average to 0.2% of the average and increased the variance of the estimator by 25%.

Computational Advantages of Minimax Constraints
An optimal matching can be found using a minimum cost flow algorithm whose worst case time bound involves the number of edges in a network, as reviewed in Section 3.1. The minimax constraint reduces the number of edges, without discarding attractive edges. As shown by Pimentel et al. (2015), use of a sparse network, with fewer edges, makes possible much more effective use of fine balance constraints.
A recent innovation in the algorithmic theory of matching solves the optimal matching problem in a sequence of stages, by peeling away the least attractive edges, thereby simplifying the problem; see Kao et al. (2001). The algorithm in the current article is related in spirit-it too peels away the least attractive edges. However, where Kao et al. solve the original optimal matching problem, the algorithm in the current article changes the objective: it minimizes the total distance subject to the constraint that the maximum distance is small.

Statistical Advantages: Matching and the Investigation of Unmeasured Biases
Matching adjusts for measured covariates, but the central problem in observational studies is potential bias from unmeasured covariates. The homogeneity of matched pairs affects, in three ways, the investigation of potential biases from unmeasured covariates.
Reducing heterogeneity of responses reduces sensitivity to unmeasured biases; see Rosenbaum (2005). For instance, let Y i be the treated-minus-control difference in outcomes in the ith matched pair. If a closer match on key covariates reduced the standard deviation of a Gaussian Y i by 20% without altering its expectation, the closer match would make the study insensitive to substantially larger unmeasured biases; more precisely, it would increase the design sensitivity. This is illustrated in an observational study in education by Zubizarreta, Paredes, and Rosenbaum (2014), where matching balanced many covariates, but pairs were made homogenous on the variables most predictive of Y i .
Effect modification refers to an interaction between the magnitude of a treatment effect and a measured covariate. To discover effect modification is to discover that an average treatment effect combines a larger and a smaller effect for different subpopulations, and the larger effect is likely to be insensitive to larger unmeasured biases; see Hsu, Small, and Rosenbaum (2013) and Hsu et al. (2015). In looking for effect modifying covariates, it is helpful to have individual pairs that are closely matched for these covariates, so main effects are not confused with interactions.
The most direct way to address unmeasured biases is to measure them. This is sometimes possible by examining a small number of individuals who are closely matched for observed covariates. For instance, electronic medical records might be supplemented by reading the far more detailed paper hospital chart for a dozen pairs, as was done by Rosenbaum and Silber (2001) in a study of mortality after surgery; see also Weller and Barnes (2014, sec. 6). To be useful as a check for unmeasured biases, the small number of examined pairs need to be very closely matched for important observed covariates; see Section 5.1.

Outline: A Practical Illustration; a New Algorithm; Practical Extensions
The article is organized as follows. Section 2 uses an example from labor economics to illustrate the practical issues without reference to network optimization. A conventional match is presented in Section 2.1: it is quite good in aggregate, but contains some poorly matched pairs. This conventional match is improved by adding one minimax constraint in Section 2.2, that is, a mean within-pair distance is minimized among all matched samples that minimize a maximum absolute pair difference. The idea of the algorithm in Section 4 is sketched informally in Section 2.3, explaining its relationship to threshold algorithms for the bottleneck assignment problem. Sections 2.4-2.6 discuss three variations on the general theme. For simplicity, the example in Section 2.1 constructs matched pairs, but the discussion in later sections is general, and applies with multiple controls matched to each treated subject, variable controls, the full matching of Rosenbaum (1991) and Hansen and Klopfer (2006), and so on; see Section 3.2 for details. Background material and notation about network optimization and statistical matching are reviewed in Section 3. The new algorithm is discussed in Section 4. The algorithm includes an optional, slightly abstract feature that can be used to form a subset of matched pairs that are matched very closely, and Section 5 discusses practical applications of this feature.

A Good Conventional Match With Dozens of Poorly Matched Pairs
In 1989, for older workers with low incomes and long job tenure, Austria substantially lengthened the duration of unemployment benefits and increased the corresponding unemployment payment. In an interesting study, Lalive, Van Ours, and Zweimüller (2006) asked whether these changes in incentives caused workers who lost their jobs to stay out of work longer. In particular, they looked at workers who met the age, income, and tenure criteria for these twice-expanded benefits, comparing workers who lost their jobs before the change in benefits (controls) and workers who lost their job after the change (treated), where only the treated group received the newly increased benefits. Lalive, Van Ours, and Zweimüller (2006) sensibly did a number of related comparisons, but the following discussion will use just one matched comparison as an illustration of certain issues in multivariate matching. In particular, Lalive, Van Ours, and Zweimüller (2006) also looked at other groups of workers for whom the benefits did not change, or changed in more limited ways, and these comparisons would require a few additional matches. The data are available at the web-page for Cahuc, Carcillo, and Zylberberg (2014) and also in the aamatch package that serves as a software supplement to the current article and recreates certain examples. There were 1391 men in the treated group (i.e., after the benefit change) who were out of work but not temporarily laid off, and each one is matched to one of the 3193 possible control men (i.e., before the benefit change). The matching controlled for age, wage in prior job, an indicator of at least 3 years of work in the past 5 years, whether the job was an apprenticeship, married or not, divorced or not, education in three levels, whether or not the previous job was a blue collar job, a seasonal job (construction or tourism), a manufacturing job. (That is, in terms of the variable names used by Lalive et al., the comparison is of type == "PBD and RR" and type == "control" for men, sfrau==0, who were not temporarily laid off ein_zus==0. The match controlled for age, nwage_pj, e3_5, lehre, married, divorced, bc, seasonal, manuf, hi_educ, med_educ.) The match, called "match 1, " used several common devices; see Rosenbaum (2010, Part II). The match was the solution to an optimization problem subject to two constraints. First, the match was constrained to be exact for education and seasonal employment, meaning that a seasonal worker with a primary education was always matched to a seasonal worker with a primary education; see match 1 in Table 1. Second, the match was constrained to exhibit near-fine-balance for the 16-level nominal variable in Table 2. A nominal covariate exhibits fine-balance if it has the same marginal distribution in treated and control groups, ignoring who is matched to whom; see Rosenbaum (1989Rosenbaum ( , sec. 3.2, 2010 and Zubizarreta (2012). A variable that is exactly matched, like "seasonal" and "education" in Table 1, is necessarily finely balanced, but a variable can be finely balanced without being exactly matched, because fine balance constrains marginal distributions, not who is paired with whom. Fine balance is not quite achievable with these data: in level 7 of Table 2, there are only 15 controls among all 3193 controls, whereas the treated group has 17 men in level 7. Match 1 exhibits near-fine-balance in the sense that it comes as close as possible to fine balance: it uses all 15 controls from level 7, and has one extra control in each of levels 4 and 11; see Yang et al. (2012) for detailed discussion of matching with near-fine balance. Subject to these two constraints, match 1 minimized the sum of 1391 distances within the 1391 matched pairs. The distances were the sum of two parts: a caliper on an estimated propensity score implemented by a penalty function plus a robust Mahalanobis distance based on all of the covariates in Table 3. For the penalty function and robust Mahalanobis distance, see Rosenbaum (2010, sec. 8).
There are a large number of possible pair matchings, ( 3193 1391 )× 1391! to be precise, but match 1 is the best of these in the specific sense of satisfying the two constraints and minimizing the total of 1391 within-pair differences. Match 1 was obtained by finding the minimum cost flow in a certain network; see Section 3. Constraints are distinguished from costs: the constraints are  Table . Distribution of four binary covariates (BC = prior blue collar job, M = prior manufacturing job, / = worked ≥ 3 of the last  years, A = apprenticeship) in the treated group, two matched control groups, (M- or M-), and all controls. Both match M- and M- came as close as possible to balancing these distributions-the counts are off by  in total for both M- and M--but unlike Table  these matches did not pair people with identical values of these covariates.

Counts Percents
expressed as part of the node-edge structure of the network, while the costs are the within-pair penalized distances described above. In many respects and by several conventional standards, match 1 was quite good; see "match 1" in Tables 1-3. In Table 3, prior to matching, all controls had fewer workers with secondary or tertiary education, more blue collar workers, more seasonal workers and other differences, and more controls in level 9 in Table 2, but match 1 has eliminated most of these differences and many others. After matching, the mean treated-minuscontrol pair difference in age was 0.23 years, and the lower and upper quartiles of the differences in age were −0.39 and 0.63 years.
At the same time, a handful of the 1391 pairs were poorly matched for age; see Figure 2. The group of men in this comparison was defined by eligibility for certain benefits, and in particular ages in this group ranged from 40 to 55. Two of the 1391 pairs had an absolute difference in age of slightly more than 10 years, two others differed by more than 8 years, and so on. This occurs because the match minimized a total distance over 1391 pairs, and that minimum occurred with a few comparatively large differences in age. Could one make the maximum difference in age smaller, perhaps much smaller, without ruining the match? Is so, how much smaller?

Adding a Minimax Constraint to Prevent Poorly Matched Pairs
Each of the ( 3193 1391 )×1391! possible matches has a maximum absolute difference in age, so that, in principle, we could compute every match and look at its maximum absolute difference in age. Some of these matches also satisfy the exact and near-fine balance constraints. In principle, if computing time were not an issue, one could: (i) examine each of the ( 3193 1391 )×1391! possible matches, retaining those that satisfy the constraints, (ii) among the retained matches, determine the smallest possible maximum absolute difference in age, (iii) from the subset of matches that both satisfy the constraints and exhibit the smallest possible maximum difference in age, pick the best one, that is a match that minimizes the total of all of the within-pair penalized robust Mahalanobis distances. Of course, steps (i)-(iii) are not computationally feasible if implemented by the exhaustive search just described; however, the algorithm in Section 4 finds in polynomial time the same matched sample. Indeed, if implemented carefully, the algorithm in Section 4 can be faster than the minimum cost flow algorithm without a minimax constraint, because the constraint eliminates many edges in the network, a key aspect of the worst case time bound for many algorithms.  every difference in match M- is inside the dashed lines, and there exists no feasible match that makes all of the differences fit within a narrower band centered at . That is, match M- satisfies the minimax constraint: it minimizes the maximum absolute difference in age achievable by any that exactly matches "education" and "seasonal" in Table  and exhibits near-fine balance for the  categories in Table .
It turns out that there exists a match that respects the constraints in Tables 1 and 2 with a maximum absolute difference in age within pairs that is less than 4.261 years, but there is no match of the ( 3193 1391 )×1391! possible matches that satisfies the constraints and has a maximum absolute difference in age strictly less than 4.26 years. Showing this takes a tolerable amount of computation. Once one knows this, an additional constraint is imposed on the match, demanding that the maximum absolute difference in age be less than 4.261 years, and this produces match 2 in Tables 1-3. Using the same definition of distance as match 1, match 2 minimizes the total of 1391 within-pair distances subject the two constraints required of match 1, namely, the constraints in Tables 1 and 2, plus one additional constraint, namely, that it minimizes the maximum absolute difference in age. Match 2 looks similar to match 1, exhibiting an exact match for "education" and "seasonal" in Table 1, a minimal deviation from fine balance of the same size in Table 2, albeit involving levels 11 and 12 rather than levels 4 and 11, and similar covariate balance in Table 3. However, as seen in Figure 2, the maximum absolute difference in age is 4.261 years in match 2, the smallest possible maximum absolute difference. Match 1 and match 2 each include 1391 controls, but they differ in 43 controls and they pair individuals differently. However, match 1 has 62 pairs with an absolute difference in age of more than 4.261 years, but match 2 has no such pairs.

Relationship Between Minimax Constraints and the Bottleneck Assignment Problem
Match 2 is the simplest example of imposing a minimax constraint on a matching algorithm that minimizes a total withinpair distance subject to additional constraints. The new algorithm in Section 4 is built from existing tools used to solve the bottleneck assignment problem, a matching problem with a single objective, namely, finding a match to minimize the maximum cost or distance; see, for instance, Garfinkel (1971) and Burkard, Dell' Amico, and Martello (2009). The bottleneck assignment problem is relevant to some problems in operations research where the cost of a subtask is the time to complete that subtask, and the whole task is complete when all subtasks are complete: then minimizing the maximum time to complete the subtasks also minimizes the time to complete the whole task.
In statistical matching problems, minimizing the maximum distance is not a reasonable objective-we want the typical distance to be very small-but minimizing the maximum distance is a reasonable additional constraint, for example, in avoiding the pointlessly large differences in age in a few dozen pairs in match 1. Garfinkel (1971) solved the unconstrained bottleneck assignment problem with a so-called threshold algorithm: (a) one picks a threshold ϑ and tries to match so all absolute age differences are at most ϑ; (b) if (a) is possible, one decreases ϑ and repeats (a); if (a) is impossible, one increases ϑ and repeats (a). A binary search quickly narrows the possible values of ϑ. Subproblem (a) is a simpler matching problem with a better computational complexity than the general minimum total cost problem, so several iterations of (a) are often not very costly. When used to set a constraint in the minimum total cost algorithm in Section 4, knowledge of Garfinkel's threshold ϑ permits many edges to be removed from the network-those with age differences larger than ϑ-and this improves the worse case time bound for the revised minimum cost flow problem. In other words, if we only consider candidate pairs that have an absolute age difference less than 4.261 years, then we drastically reduce the number of candidate pairings that need to be considered, and a threshold algorithm has found 4.261 years to the minimum possible value.

Constraining Several Order Statistics in Sequence, Not Just the Maximum
Match 3, not shown in the tables, also satisfied the constraints in Tables 1 and 2, but it imposed two constraints on the absolute difference in age: (i) it forced the largest absolute difference in age to be less than 4.261 years, and (ii) it forced the second largest absolute difference in age to be less than 3.072 years, the smallest possible value. That is, in match 3, the constraints in Tables 1 and  2 are satisfied, all of the absolute differences in age are less than 4.261 years, and only one difference is larger than 3.072 years; moreover, no match that satisfies the constraints yields smaller values for these two largest order statistics. These constraints are understood sequentially: 4.261 is the smallest maximum difference among matches that satisfy the constraints in Tables 1 and  2, whereas 3.072 is the smallest maximum second-order statistic among matches that both satisfy the constraints in Tables 1 and  2 and minimize the maximum absolute difference. Arguably, match 3 is not much better than match 2, because 3.072 years is not a big improvement over 4.261 years. Match 3 is mentioned solely to illustrate a feature of the algorithm in Section 4. One could minimize the maximum of another order statistic, say the 95% quantile, in a similar manner.

Minimizing the Maximum Mahalanobis Distance or Propensity Score Difference
Although match 2 minimized the maximum absolute withinpair difference in a single covariate, here age, the minimax constraint need not refer to a single covariate. Match 4, described in Table 4, resembles Matches 1-3 in being exact for "education" and "seasonal, " and in exhibiting near-fine balance for four variables in Table 2 ; however, match 4 imposes a constraint of minimizing the maximum of the robust Mahalanobis distances based on 10 covariates without penalties, before going on to minimize the sum of penalized distances. The frequencies for Match 4 corresponding to Table 2 are identical to the frequencies for Match 1, and the same is true for  Table 4 is the smallest possible maximum. Additionally, the distribution of matched distances in Table 4 is much smaller than the distribution before matching. In parallel with Section 2.4, one can impose a sequence of minimax constraints with different objectives. For instance, replacing age by the propensity score, one could constrain the match to minimize the maximum absolute difference in propensity scores, then add the further constraint that among matches that minimize the maximum absolute difference in propensity scores, the match must minimize the maximum robust Mahalanobis distance.

Minimax Constraints Within Exact Match Strata
Some of the restrictions in Table 1 are very restrictive. Consider the first row, workers not in seasonal employment with a primary education. Ignoring the fine balance constraints in Table  2, the workers in the first row can be made into a matched sample in ( 2024   904 )×904! ways, so there are many choices of who and how to match, thereby reducing within-pair robust Mahalanobis distances. In contrast, in the last row of Table 1-seasonal workers with tertiary education-there are only ( 5 5 )×5!=120 ways to pair the workers. Counting the number of matched samples that also respect the fine balance constraints in Table 2 is much more difficult because these constraints aggregate the exact match categories in Table 1. Not surprisingly, the largest robust Mahalanobis distance in match 4 in Table 4, namely 14.49, is between two seasonal workers with tertiary education, a small stratum with few choices. Perhaps we should impose a tighter constraint on the robust Mahalanobis distance in the first row of Table 1 than the constraint of 14.49 required for the last row of Table 1. We cannot do better than 14.49 overall, because we cannot do better than 14.49 in the last row of Table 1, but we could have different requirements for different strata.
Specifically, the method in Section 4 may be used to minimize the maximum robust Mahalanobis distance yielding, as in Section 2.5, 14.49 in the sixth stratum in Table 1. With that constraint in place, it can minimize the maximum robust Mahalanobis distance over the union of the five remaining strata. Optionally, this could continue with 4 strata, then 3, and so on. As in the case of the second largest order statistic in Section 2.4, once the maximum is tolerably small, the later steps in this process may not be worth pursuing.

Review: Minimum Cost Flow in a Network
A directed graph is a finite set of nodes, N = {ν 1 , . . . , ν N }, together with a finite set E of directed edges consisting of Table . Quantiles of all 1391 × 3193 = ,, robust Mahalanobis distances and  such distances in Match  that (i) was exact for "education" and "seasonal, " (ii) was near-fine balanced for "blue collar job, " "manufacturing job, " "worked  of the last  years, " and "apprenticeship, " and (iii) minimized the maximum robust Mahalanobis distance, and (iv) subject to these constraints, minimized the sum of the within pair robust Mahalanobis distances penalized by a propensity score caliper. A directed graph is drawn on paper by representing each node ν ∈ N by a point and each edge (ν, ν ) ∈ E by an arrow with tail at ν and head at ν . A directed graph is acyclic if there are no cycles, that is, no sequences (ν, ν ), (ν , ν ), . . ., (ν * , ν * * ) of edges in E such that ν = ν * * . A network is an acyclic directed graph (N , E ) such that each node ν ∈ N has an integer demand, d(ν) ∈ {. . . , −2, −1, 0, 1, 2, . . .}, and each edge ∈ E has a nonnegative real cost, p( ) ≥ 0, and a nonnegative integer capacity q( ) ∈ {0, 1, 2, . . .}. A flow f (·) is a function assigning a nonnegative integer f ( ) to each edge ∈ E, that is, f : E →{0, 1, 2, . . .}. A flow respects the capacity constraints if

Min
( 1 ) A flow meets the demand d(ν) if the total flow into node ν from edges (ν , ν) ∈ E minus the total flow out from ν on edges A flow that satisfies both (1) and (2) is feasible. Let F be the set containing all the feasible flows f, noting that F is empty if no feasible flow exists. The cost p f of a feasible flow is p f = ∈E f ( ) p( ). If F = Ø, then a feasible flow f ∈ F has minimum cost if p f ≤ p f * for all feasible flows f * ∈ F. As discussed in Section 3.2, many statistical matching problems are solved as minimum cost flow problems in a network; however, other approaches are possible, as discussed by Zubizarreta (2012).
The minimum cost flow problem has several algorithmic solutions with attractive worst-case polynomial time bounds. In practice, this means that large problems can be solved reasonably quickly. Some algorithms have worst-case time bounds that involve only the size of network, |N | and |E|, whereas others additionally involve the costs, p(·), capacities q(·), or demands d(·); see Bang-Jensen and Gutin (2009, Section 4.10) for discussion of several algorithms and their worst-case bounds. One good solution is made conveniently available inside R by loading the optmatch package: it is the Fortran code for Relax IV of Bertsekas and Tseng (1988). The calculations in the current article used Relax IV.
A flow vector, { f ( ), ∈ E}, is a vector of dimension |E|, and the minimum cost flow problem is a search for an optimal flow vector. The worst case time bounds for good algorithms for the minimum cost flow problem invariably involve |E|, with a much better, much lower time guarantee if |E| is smaller. A single minimax constraint of the type in Section 2.2 can be implemented by removing high-cost edges (or equivalently large distance pairings) from E, often making |E| much smaller. In other words, in Section 2.2, changing the problem to disallow large pair differences in age produces a new problem that can be solved more quickly. A sequence of minimax constraints of the type in Section 2.6 can be implemented as a sequence of removals of edges from E, so each step in this sequence has a lower |E| than the previous step. In practical terms, if we do not adhere to the single-minded pursuit of the smallest total within-pair distance, p f = ∈E f ( ) p( ), but rather wish to pursue that objective with the added constraint that no single matched pair has an exceptionally large distance, then we are likely to receive the additional benefit of an improved worst-case time bound by virtue of making |E| smaller.

Networks Used in Multivariate Matching
The networks used for matching in observational studies have a particular structure. There is a set T = {τ 1 , . . . , τ T } of treated nodes, a set C = {γ 1 , . . . , γ C } of potential control nodes with C ≥ T , and a set A = {α 1 , . . . , α A } of additional nodes, where T , C, and A are disjoint, and N = T ∪ C ∪ A. In Section 2, T contains the T = 1391 treated individuals, C contains the C = 3193 potential controls, and A = {α 1 , . . . , α 16 , α 17 } contains 16 additional nodes, α 1 , . . . , α 16 , for the 16 fine balance strata in Table 2 plus one more node α 17 that controls the deviation from fine balance. If fine balance were achievable in this example, then node α 17 would not be needed. In Section 2, the additional nodes force the near fine balance in Table 2 without pairing people with identical strata. By varying the structure of the additional nodes A, a variety of balancing or matching properties may be produced. See figures 1 and 2 in Rosenbaum (1989), figure 2 in Rosenbaum (1991), and figure 1 in Pimentel et al. (2015) for depictions of some of the many possible matching networks with additional nodes A. The new techniques described in the current article will work with any structure for A, so various structures are mentioned but no one structure is assumed.
There is an edge = (τ t , γ c ) ∈ E if τ t can be matched to γ c . For example, in Table 1, the 5/1391 treated men with prior seasonal employment and tertiary education can only be matched to the 5/3193 control men with prior seasonal employment and tertiary education, so each such treated man τ t has only five edges in E. In contrast, in Table 1, the 406 treated men with seasonal employment and primary education each have 1063 edges in E, making 406 × 1063 = 431,578 in total for seasonal workers with primary education. The cost p( ) for = (τ t , γ c ) ∈ E is the covariate distance between τ t and γ c . In match 1 in Section 2, the covariate distance between τ t and γ c was the sum of the robust Mahalanobis distance between the covariates for τ t and γ c plus a penalty if τ t and γ c differed on the propensity score by more than the caliper allowed. For each treated-control edge In the example in Section 2, there is an edge, = (γ c , α a ) ∈ E, connecting each control γ c to the one fine balance level, α 1 , . . . , α 16 , that contains γ c ; see Table 2. In the example in Section 2, but not in all applications, the cost p( ) for = (γ c , α a ) ∈ E is p( ) = 0 and the capacity is q( ) = 1. Also, in this example, each of these 16 fine balance levels, α 1 , . . . , α 16 , is connected by an edge to α 17 with cost p(α a , α 17 ) = 0 and capacity q(α a , α 17 ) = 2. The node α 17 will allow at most a treatedminus-control difference in frequencies of 2 in any of the 16 levels in Table 2.
In the example in Section 2, but not in all applications, the demands have the following structure. Every treated unit τ t ∈ T has demand d(τ t ) = −1, meaning that it issues one unit of flow, and hence τ t must be matched to some control γ c . The demand for every control γ c ∈ C is d(γ c ) = 0 meaning that γ c passes on all the flow it receives from treated units τ t . Because each control γ c is the tail of just one edge = (γ c , α a ) ∈ E with capacity q( ) = 1, it follows that γ c can receive only one unit of flow from treated units τ t , so γ c can be matched to only one treated unit. The demands d(α a ) for the 17 additional nodes are determined by examining two columns in Table 2, specifically the counts for treated subjects and all controls. As noted in Section 2, the number of controls in inadequate in level 7 in Table 2 by a count of 2, so we set d(α 7 ) = 15 and d(α 17 ) = 2, saying that all 15 available controls from level 7 are used and two replacements from other levels are used also with their two units of flow absorbed by α 17 . In the 15 other levels in Table 2, there are enough controls available, so d(α a ) is set to the count in the treated group for level a for a = 1, . . . , 6, 8, . . . , 16, for example, d(α 1 ) = 10, d(α 2 ) = 6, and so on. In this network, any feasible flow matches exactly for education and seasonal employment and balances the 16 strata in Table 2 with a deviation of exactly 2. A feasible flow f of minimum cost p f additionally minimizes the total of the 1391 penalized within-pair covariate distances; here, Yang et al. (2012) for more about "near-fine balance. " Small changes may be used to produce matched samples with other desired properties; however, in all cases, The examples show that the algorithms in Section 4 work with varied structures for the additional nodes A, not just the paired, near-fine balance structure in Section 2. For instance, setting d(τ t ) = −2 would request matching of 2 controls to each treated individual, or 2 × 1391 = 2782 controls in total. No match with d(τ t ) = −2 is feasible in this example unless the constraints are relaxed: for instance, there are only five potential controls with seasonal employment and tertiary education in Table 1, and the deviation from fine balance would have to be larger than d(α 17 ) = 2 in Table 2. Minor adjustments permit a match with variable numbers of controls. Keele, Pimentel, and Yoon (2015) set the required number −d(τ t ) of controls for τ t based on their "entire number, " a quantity calculated from the propensity score value for individual τ t . In the hierarchical structure used by Pimentel et al. (2015), some α a ∈ A are connected to other α a ∈ A, where (α a , α a ) ∈ E means that the stratum represented by α a subdivides the coarser stratum represented by α a .

Each Possible Match Yields Different Order Statistics
Let I ⊆ T be an interesting subset of the treated subjects, typically all of them, I = T . See Section 5 for discussion of I = T . For each edge = (τ, γ ) ∈ E with τ ∈ I and γ ∈ C there is a nonnegative number h ≥ 0. In Section 2.2, h was the absolute difference in age between τ and γ , whereas in Section 2.5 it was a robust Mahalanobis distance. For any feasible flow f, and f {(τ, γ )} = 1 for τ ∈ I, that is, there are L controls γ matched to the treated subjects τ ∈ I. For pair matching with I = T , there are L = T such edges, one for each matched pair; whereas, with I = T and matching with two controls there are L = 2T such edges. For a given feasible flow, sort the L quantities h (τ,γ ) with τ ∈ I and f {(τ, γ )} = 1 into nondecreasing order, as h f ,1 ≤ h f ,2 ≤ · · · ≤ h f ,L , thereby yielding the order statistics for h in the match f. To emphasize, h f , is different for each match f. First, consider the maximum, h f ,L . In Section 2.2, h f ,L is the maximum absolute difference in age among the T = L = 1391 matched pairs. In match 1, h f ,L was more than 10 years, but h f ,L achieved its minimum possible value of 4.261 for match 2.
In general, if F = Ø, define h min, = min f ∈F h f , , so h min, is the smallest achievable value of the th-order statistic; otherwise, if F = Ø, then define h min, = ∞. In Section 2.2, h min,L = 4.261 years. Define F min, ⊆ F to be the subset of feasible flows that achieve the minimum value of h f , , that is, F min, = { f ∈ F : h f , ≤ h min, }. In Section 2.2, many feasible flows f minimize the maximum absolute difference in age, having h f ,L = h min,L = 4.261 years, and match 2 is the best of these in the sense of minimizing the total p f of the T = 1391 robust Mahalanobis covariate distances within matched pairs.
It is important to notice that if F = Ø, then F min, = Ø. In words, whenever there is a feasible match without constraining the order statistic, there is also a feasible match that satisfies the constraint h f , ≤ h min, . The reason for this is that the constraining value, h min, , was defined in such a way that it was achieved by at least one flow f ∈ F. An attraction of constraints on order statistics is that they never create infeasibility. In contrast, in Section 2.2, had one had added the constraint that every pair differ in age by at most k years, then the problem would be feasible for some k, infeasible for other k, and it would not be obvious by inspecting the distribution of age what values of k yield feasibility given the other constraints in Tables 1 and 2. By solving the minimax problem, we know that the match is feasible for k ≥ 4.261 years and infeasible for k < 4.26 years.

Imposing a Minimax Constraint on a Minimum Total Distance Match
Problem A imposes a single minimax constraint on the minimum cost flow problem. Problem A seeks a feasible flow f such that the total cost p f of this flow is minimal among all feasible flows f * with h f * ,L = h min,L , that is, among all feasible f * that minimize the maximum h (τ,γ ) for matched pairs with τ ∈ I.
Problem A (Adding a minimax constraint): Let F be the set of feasible flows for some matching network with the structure (T ∪ C ∪ A, E ) in Section 3.2, let h min,L = min f ∈F h f ,L , and let F min,L = { f ∈ F : h f ,L ≤ h min,L }. Find an f ∈ F min,L such that p f ≤ p f * for all f * ∈ F min,L or determine that the problem is infeasible in the sense that F = F min, = Ø.
Problem A was solved in Section 2.2 with h (τ,γ ) equal to the absolute difference in age.

Iterating Minimax Constraints
Problem A can be iterated to impose a sequence of minimax constraints, each new constraint further constraining the previous problem. This is done as follows. Suppose we have solved a problem with the form of Problem A. If this problem is infeasible, if F = Ø, then we are done: additional constraints will not produce feasibility. Otherwise, if the problem is feasible with F = Ø, then we have determined the number h min,L . Define a new network (T ∪ C ∪ A, E * ) where E * consists of all edges of E except those (τ, γ ) ∈ E with τ ∈ I, γ ∈ C, and h (τ,γ ) > h min,L . By the definition of h min,L , the set, say F * , of feasible flows for this new network is not empty because F = Ø and a flow f ∈ F achieved h f ,L = h min,L . Continuing, if the original problem is feasible with F = Ø, and if we have another measure h * ≥ 0, then we may solve a new problem with the form of Problem A but with network (T ∪ C ∪ A, E * ) and constraining variable h * , and this new problem is also feasible, and its solution f is constrained to minimize h f ,L , and constrained to minimize h * f ,L among flows that minimize h f ,L . Because |E * | ≤ |E|, iterates of Problem A have improved worst-case time bounds; see Section 3.1.
For example, one can minimize the maximum absolute difference h in propensity scores, and subject to that constraint, minimize the maximum robust Mahalanobis distance h * , as mentioned in Section 2.5. Alternatively, with exact match strata in Section 2.6, one can minimize the maximum robust Mahalanobis distance h , discover that the maximum was achieved in, say, stratum 6, then set h * = 0 for treated τ and control γ subjects in stratum 6 and set h * = h otherwise. The next iterate will minimize the maximum robust Mahalanobis distance in all strata other than stratum 6 while forcing stratum 6 to continue to minimize its maximum.

Threshold Approach to Imposing a Minimax Constraint
Corollary 1 in Section 4.5 shows that Algorithm A below solves Problem A.
Step 3a resembles Garfinkel's (1971) threshold algorithm for the bottleneck assignment problem: it is a binary search to determine h min,L , whereupon a minimum cost flow problem is solved in Step 4a with a reduced set of edges E. In Step 3a, the edge set E gradually shrinks to include fewer and fewer edges. In Step 4a, the edge set E has been reduced as much as possible. For example, in Section 2.2, had one started with a guess of H mid = 5 years, then all edges with an absolute difference in age of more than 5 years would have been removed at the end of the first iteration of Step 3a.
The algorithm maintains the inequality H low ≤ h min,L ≤ H high and terminates when H low ≤ h min,L ≤ H high < H low + δ. With the value of δ from Step 1a, there is only one value h in the interval [H low , H low + δ], so that value is h min,L . If one uses a larger value of δ in Step 1a, then one is not strictly minimizing the maximum h f ,L in Step 4a, but no pair = (τ, γ ) will be produced by step 4a in which h > h min,L + δ. A larger value of δ means fewer iterates of Step 3a, and perhaps extreme precision in this step is of little practical importance. For example, in Section 2.2, taking δ = 1/2 would mean that the maximum

Threshold Approach to Imposing a Quantile Constraint
Minimizing a quantile other than the maximum can be approached in a similar way, but does not permit further elimination of edges from E. In practice, one might first eliminate all edges with costs greater than the minimax constraint, thereby obtaining a sparse network and minimizing the maximum distance, then impose an additional quantile constraint to further reduce the number of large distances.

Problem B (Adding a quantile constraint):
For a fixed ∈ {1, . . . , L}, let F be the set of feasible flows for some matching network with structure in Section 3.2, let h min, = min f ∈F h f , , and let F min, = f ∈ F : h f , ≤ h min, . Find an f ∈ F min, such that p f ≤ p f * for all f * ∈ F min, or determine that the problem is infeasible in the sense that F = F min, = Ø.
Algorithm B differs from Algorithm A in Steps 3b and 4b. In minimizing an order statistic below the maximum, < L, perhaps = L − 1 in Section 2.4 or = 0.95L , one tolerates at most L − + 1 matched pairs = (τ, γ ) with h > h min, . This has two consequences in Step 3b. First, Step 3b finds h min, by counting the number of matched pairs with h > h min, , rather than determining whether there are any in Step 3a of Algorithm A. More importantly, Step 3b cannot remove edges when h > h min, , because L − + 1 units of flow may pass along these edges, so |E| does not shrink. In parallel, in Step 4b, the match is forced to have at most L − + 1 pairs with h > h min, using the large penalty β rather than by removing all edges with h > h min, . A minimax constraint in Problem A forbidding h > h min,L can eliminate edges with h > h min,L , but a quantile constraint limiting the number of pairs with h > h min, to at most L − + 1 cannot remove edges with h > h min, . Combining minimax and quantile constraints removes edges and further reduces large distances.

Algorithm B: Minimizing a Specific Quantile
Step 1b Step 4b: In the network (T ∪ C ∪ A, E ) in Section 3.2, retain the demands and capacities in Section 3.2, but revise the costs to p ( ) = p ( )+β if h > H high and = (τ, γ ) ∈ E with τ ∈ I, and p ( ) = p ( ) in all other cases. Return a minimum cost flow f in this network.
Proposition 1. For sufficiently large β, Algorithm B solves Problem B.
Proof. If the problem is infeasible, this is determined in the first iteration of Step 3, and as noted in Section 4.1, this means the problem was infeasible before adding the constraint on the thorder statistic. So suppose Problem A is feasible, F min, = Ø. Proof. In the proof of Proposition 1, if = L, then the penalty β forces no flow to pass along edges = (τ, γ ) ∈ E with τ ∈ I such that h > h min,L , so these edges could be removed from E without changing the optimal flow f produced by Algorithm B. However, if these edges are removed, Algorithm B becomes Algorithm A.

Using I to Obtain a Subset of Controls Who Are Very Closely Matched
The methods in Section 4 constrained a maximum or an order statistic of h for pairs (τ, γ ) ∈ E with τ ∈ I ⊆ T , but all of the examples in Section 2 concerned the case of I = T . To let I be a proper subset of T is to ask the algorithm to produce a particularly close individual pairing (τ, γ ) for a specific subset of treated subjects τ ∈ I. When might this be useful?

Case Conferences in Medicine and Other Forms of Thick Description
In an effort to audit the cost and quality of surgery at 217 U.S. hospitals, Silber et al. (2014) sampled a representative template of 300 surgical patients, then matched surgical patients at each hospital to the template, producing a 300 × 217 array with rows defined by the template patient and columns defined by the hospital. In this way, each hospital could compare its own performance with its own patients to the performance of other hospitals when performing the same surgical procedure on similar patients. A comparison of 300 patients is useful. However, a standard part of medical quality control is the case-conference, in which the care of one patient is reviewed in detail. The New England Journal of Medicine often publishes its "Case Reports from the Massachusetts General Hospital, " which are essentially caseconferences in essay form. Template matching offers to enhance a case-conference by permitting a comparison of a case in one's own hospital and very similar cases in other hospitals.
To be useful in such a comparative case-conference, the pair of patients in different hospitals would need to be very close as individuals, not a small part of a large counterbalancing scheme for 300 patients involving propensity scores and fine balance constraints. By letting I ⊂ T be a subset of, say, |I| = 10 interesting template patients, one could ensure exceptionally close individual pairs for all hospitals for these 10 cases, while using propensity scores and fine balance to produce comparable groups of 300 patients.
More generally, thick, narrative, or ethnographic description of a small number of closely matched pairs is often an enlightening counterpoint to quantitative analysis of many pairs. For discussion see Rosenbaum and Silber (2001) and Weller and Barnes (2014).

Close Pairs When Matching With Multiple Controls
When matching each treated individual to several controls, say three controls, it can be helpful that each matched set contains one control who is very close to the treated subject in terms of specific covariates. For instance, this can be advantageous when studying effect modification involving a few covariates (Hsu, Small, and Rosenbaum 2013;Hsu et al. 2015) or when seeking reduced sensitivity to unmeasured biases by reducing withinpair heterogeneity of response (Rosenbaum 2005). For the other two controls in the same matched set, the covariates would be balanced as a group, as in Table 2, but might not be close as individual pairs. In this way, one has, in a single study, both the larger sample size of a 3-to-1 match and the less heterogenous pairing of a 1-to-1 match.
To create such a match, replace T = {τ 1 , . . . , τ T } by T = {τ 1 , . . . , τ T , τ 1 , . . . , τ T } where τ t is a second copy of treated subject τ t , and let I = {τ 1 , . . . , τ T } contain the first copies. The network optimization would operate as if τ t and τ t were distinct individuals, and after matching their controls would be joined to form matched set t. For the 3-to-1 match described above, τ t would have demand d(τ t ) = −1 and τ t would have demand d(τ t ) = −2. Use of Algorithm A with I = {τ 1 , . . . , τ T } and h equal to the robust Mahalanobis distance would minimize the maximum distance for the τ t ∈ I, trying to ensure one very close control for every treated individual in every matched triple. When matching with variable numbers of controls, the duplicates τ t demands could be determined by the entire number of Keele, Pimentel, and Yoon (2015).

Discussion: Searching to Find the Best Feasible Constraint
Optimal matching in observational studies minimizes the total of within-pair covariate distances subject to various constraints that enforce covariate balance or exact matches. This is done by finding a minimum cost flow in a network, a standard problem in operations research for which good polynomial-time algorithms exist. A match that is optimal in this sense may include a moderate number of pairs with large covariate distances. Here, a method has been proposed and illustrated for adding one or more new constraints that prevent any pairs from exhibiting larger covariate distances than are strictly needed to satisfy the constraints. Stated more precisely, the new constraint forces the match to achieve the minimum possible value of the maximum within-pair covariate distance among all matches that satisfy the constraints; then, with that additional constraint in place, the match minimizes the total within-pair covariate distance. To set this additional constraint, the needed minimax value is determined by a binary search involving repeatedly constructing a matched sample of a simple form; then, with this value determined, edges are removed from the matching network forbidding matches that exceed the minimax value. Removing edges from a matching network improves the worst-case time bound for computing the minimum cost flow. Here, removing edges means forbidding consideration of many matched pairs that we did not want anyway. As discussed at the end of Section 4.4, the algorithm may be accelerated by forcing the maximum distance in the match to exceed the minimax value by at most a small error δ. In the example with 1391 matched pairs, the minimax constraint avoided dozens of large covariate distances and exhibited excellent covariate balance. Various generalizations were discussed and illustrated, including quantile constraints and iterated minimax constraints.

Supplementary Materials
Online supplementary material contains an R package, aamatch, that includes: (i) a limited implementation of the method, (ii) the data used in the example, (iii) an example that reproduces some of the matched comparisons for the example.