Matching One Sample According to Two Criteria in Observational Studies

Abstract Multivariate matching has two goals (i) to construct treated and control groups that have similar distributions of observed covariates, and (ii) to produce matched pairs or sets that are homogeneous in a few key covariates. When there are only a few binary covariates, both goals may be achieved by matching exactly for these few covariates. Commonly, however, there are many covariates, so goals (i) and (ii) come apart, and must be achieved by different means. As is also true in a randomized experiment, similar distributions can be achieved for a high-dimensional covariate, but close pairs can be achieved for only a few covariates. We introduce a new polynomial-time method for achieving both goals that substantially generalizes several existing methods; in particular, it can minimize the earthmover distance between two marginal distributions. The method involves minimum cost flow optimization in a network built around a tripartite graph, unlike the usual network built around a bipartite graph. In the tripartite graph, treated subjects appear twice, on the far left and the far right, with controls sandwiched between them, and efforts to balance covariates are represented on the right, while efforts to find close individual pairs are represented on the left. In this way, the two efforts may be pursued simultaneously without conflict. The method is applied to our on-going study in the Medicare population of the relationship between superior nursing and sepsis mortality. The match2C package in R implements the method. Supplementary materials for this article are available online.


Application: Nursing Resources and Health Outcomes
The presence or absence of advanced technology is conspicuous when comparing hospitals, but the training and experience of staff, and the administrative environment in which they work may be as or more important for health outcomes. In recent work, we have found higher rates of 30-day survival following surgery at hospitals with superior nursing resources (Silber et al. 2016;Lee et al. 2018;Lasater et al. 2021a), the difference in mortality being most pronounced among the 20% of patients at highest risk of death. Moreover, the difference in survival was accompanied by quite modest differences in the cost of the services provided. In particular, Silber et al. (2016) examined 25,076 matched pairs of two Medicare patients undergoing the same surgical procedure, one at a hospital with superior nursing, the other at a hospital without superior nursing, where the pairs were matched exactly for 130 ICD-9 principal surgical procedures (i.e., a nominal covariate with 130 categories) and balanced patient age, year of admission, sex, race, emergency department admission status, transfer-in status, 31 comorbidities such as diabetes and congestive heart failure, a propensity score, a risk-of-death score, and other covariates. The weights used in the baseline risk-of-death score were estimated from another dataset, and they reflected the surgical procedure and comorbid conditions, among other factors. The difference in mortality was 1% (superior nursing 4.8%, other 5.8%), but this rose to 2.6% (superior nursing 17.3%, other 19.9%) in the 20% of pairs with the highest estimated probabilities of death; yet, the mean difference in cost was estimated to be only $74, rising to $941 in the highest risk quintile (Silber et al. 2016, Table 5). In the lowest risk quintile, the risk was extremely low regardless of nursing resources. The added expense of superior nursing was partially offset by reduced use of the intensive care unit and slightly shorter length of stay.
A variety of possible causal questions attend such a comparison. Most speculative are questions that imagine restructuring existing hospitals to employ more nurses with better training in improved administrative environments. That might work, but the actual matched comparison does not involve restructuring hospitals. Least speculative are questions that imagine channeling or assigning higher risk patients to existing hospitals with better nursing resources, because the matched comparison did observe similar patients at existing hospitals with different nursing resources. These less speculative questions concern effect modification: the possibility that more capable hospitals do more to benefit the highest risk patients. It is simpler to study effect modifiers if individuals are closely paired for effect modifiers.
So, it helps to have a matched sample that is closely paired for a few key covariates and that balances many other covariates. Indeed, matching in this way tends to exhibit reduced sensitivity to unmeasured biases (Rosenbaum 2005;Zubizarreta et al. 2014). The current paper proposes a new methodology for matching closely for key covariates while balancing many other covariates.

Covariate Balance Versus Closely Matched Individual Pairs
Covariate balance refers to similar distributions of observed covariates in treated and control groups, not to who is matched to whom. A pair is closely matched for a covariate if has nearly the same value for the two paired individuals. It is impossible to match closely for many covariates, a fact sometimes described with melodrama as the "curse of dimensionality". Ignoring the 130 principal surgical procedures and many other covariates, focusing on just the 31 binary comorbidities in Silber et al. (2016), there are 2 31 = 2.1 × 10 9 or two billion categories, far more comorbidity categories than there are patients in all of Medicare. In truth, this is not a curse, merely an idée fixe, an insistence that a mathematical problem-adjusting for observed covariates-have a certain form-conditioning on nearly identical values of covariates. Randomized experimentation performs causal inference with no curse of dimensionality, not by comparing identical people, but by making sure that the treatment a person receives is statistically independent of everything that makes people different. In a completely randomized experiment, randomization inference allows for the small imbalances in outcomes that may occur by chance in the absence of a treatment effect, and does this without conditioning on any covariates, observed or unobserved. Matching does not address unobserved covariates, unlike randomization, but as in experiments, it suffices to balance high dimensional covariates rather than condition upon them (Rosenbaum and Rubin 1983, Theorem 3). Several strategies attempt to balance covariates without pairing individuals who have similar values of those covariates. A few follow. First, matching closely for the propensity score tends to balance observed covariates used to define the score (Rosenbaum and Rubin 1983, Theorem 1). Second, fine balance is a constraint requiring a matched sample to maximally balance a nominal covariate, perhaps one with many levels (Rosenbaum 1989, sec. 3.2). Refined balance represents a nominal covariate in a hierarchical tree structure, and attempts to balance to the most refined extent possible; one example balanced more than 2.8 million categories . Directional penalties alter covariate distances to penalize an objective function for contributing to covariate imbalances, resembling Lagrangians in integer programming (Yu and Rosenbaum 2019).

Outline: A New Matching Algorithm; Simulation; Application
A new approach to multivariate matching is described in Section 2: it is a combinatorial optimization algorithm that separates covariate balancing from close pairing, yet can be implemented to run in polynomial time. When pairing is used to balance covariates, the two goals of close pairs and covariate balance can work against one another; however, the new algo-rithm separates these tasks, so balancing many covariates need not interfere with pairing closely for a few covariates. How can the two tasks of covariate balancing and close pairing work against one another? If a treated woman is close to no control woman on key covariates, but is close to a control man on these covariates, then we could match her to the man, yet perfectly balance gender by matching some treated man to a close control woman. We might prefer this match, which balances gender, even if the treated man had available a close control man, because use of that control would leave gender outof-balance. In brief, counter-balancing an imbalance in one pair may mean picking someone further away in another pair.
Section 3 clarifies the new method using simple cases, comparing them with conventional methods. One simple case of the new method that is not an instance of existing methods uses two calipers on the propensity score, a loose caliper for pairing and a tight caliper for balancing. We want pairs to be exactly matched for several nominal variables, say cancer stage and grade, and we want paired individuals to differ in their propensity scores by at most, say ±0.2 × s where s is the standard deviation of the propensity score, but we also want the entire distribution of propensity scores to be much closer, so a balancing caliper of ±0.05 × s is also used.
More complex but more important versions of the new method are described in Section 4. A simulation in Section 5 compares one implementation of the new method to a standard method. The new method is applied in Section 6 to our on-going study of sepsis mortality in hospitals with better and worse nurse resources.

Vertices and Edges
Matching has long been represented by a so-called bipartite or two-part graph (e.g., Bertsekas 1991), the nodes in the first part being the T treated individuals, the nodes in the second part being the C ≥ T potential controls. Edges connect treated individuals to potential controls, and each edge has a cost. An optimal match finds a control for each treated individual while minimizing the sum of the T within pair costs. Taking some liberties with precise expression, we refer to the new matching structure as tripartite, because the treated individuals appear twice, on the left and on the right, with the controls sandwiched between them; however, more exactly, the structure is a network with these features and some additional needed features. For general discussion of network optimization, see Ahuja, Magnanti, and Orlin (1988) or Bertsekas (1991). Once the problem is shown to be equivalent to finding the minimum cost flow in a network, it is solved by general algorithms for that problem. As discussed later, in R, we apply the Relax IV software of Bertsekas and Tseng (1988) to find the minimum cost flow.
Two versions of the network structure are depicted in Figure 1. There are T treated individuals, T = {τ 1 , . . . , τ T }, and each treated individual is to be matched to κ ≥ 1 controls from among the C ≥ Tκ potential controls available, C = {γ 1 , . . . , γ C }, where each control is used in at most one κ-to-1 matched set. Each treated individual appears twice in the network, as τ t and as τ t , and the same is true of each Each treated individual, t = 1, . . . , T, appears twice as τ t andτ t , and each potential control, c = 1, . . . , C, appears twice as γ c andγ c . There is a source, ς and a sinkς . When matching each treated individual to ≥ 1 controls, each edge that connects to the source ς or to the sinkς has capacity and cost zero. All other edges have capacity 1. The source ς emits T units of flow, the sink absorbs T units of flow, and flow is conserved at all other nodes. potential control, γ c and γ c . The informal intuition underlying Figure 1 is as follows. Controls γ c and their alter-egos γ c are sandwiched between the treated subjects τ t and their alteregos τ t . The alter-egos are a technical tool, explained below. Ultimately, we have only treated-control pairs, (τ t , γ c ), but their alter-egos γ c , τ t force the Tκ pairs to have certain properties as a population-for instance, covariate balance-that they do not have as individual pairs. This is familiar: a single water molecule cannot be ice or liquid or steam, but a collection of water molecules can have these properties. Covariate balance is a property of a population of pairs that is not evident in any single pair. A treated subject τ t will be paired with a control γ c on the left side of Figure 1 by one criterion. The control γ c is always paired with its alter-ego, γ c -that technical step ensures the same controls are selected for both sides of Figure 1. Also, γ c is paired by a second criterion with the alter-ego τ t of a typically different treated individual, τ t , on the right of Figure 1. So, implicitly a treatment-control "pair" (τ t , γ c ) is really a structure τ t , γ c , γ c , τ t where typically t = t . The pairing, (τ t , γ c ), decides who is paired with whom; however, γ c , τ t will force the collection of pairs to have some desired property. For example, (τ t , γ c ) might differ by one year in age, but γ c , τ t might force the proportion of women to be the same in the treated and control groups as a whole, without necessarily pairing women to women. That is, γ c , τ t may force sex to be balanced without pairing for sex. In the technical argument that follows, a unit of flow will pass along the path τ t , γ c , γ c , τ t thereby identifying who is paired with whom. An optimal match is a minimum cost flow.
There is also a source ς and a sink ς. As usual, write |S| for the number of elements in a finite set S, and write S − S = s : s ∈ S and s / ∈ S for two sets S and S . The network has |V| = 2 + 2T + 2C vertices (1) or |V| = O (C) vertices using the fact that C ≥ Tκ.
An edge e is an ordered pair of vertices. The network in Figure 1 has edges so there are |E| = 2T + C + 2TC = O C 2 edges, because C ≥ Tκ. Edges (τ t , γ c ) are called the "left side" of the network, and edges γ c , τ t are called the "right side. " The left side will be used to pair closely for key covariates, while the right side will be used to balance covariates. Because covariate balance does not refer to who is paired with whom, it is helpful to separate these two activities. It may be helpful to notice that the left and right sides of the network are simply two sets of requirements imposed upon the match: we choose to regard the criteria on the left as defining who is paired to whom, but this merely says that the criteria we placed on the left side are the ones we regard as important for pairing. We speak of (V, E) as 'complete' if E has every possible (τ t , γ c ) and every possible γ c , τ t , with |E| = 2T + C + 2TC. In a complete network, the structure of the edges does not constrain who can be matched to whom: every 1-κ match is possible, though some may be preferable to others. In contrast, the second network in Figure 1 is sparser, meaning that some (τ t , γ c ) and γ c , τ t have been omitted. The degree of sparsity of the network has important consequences for the speed of computation and the memory required to store a network on a computer. In a sparser network, it is often possible to match hundreds of thousands of people in a single optimization; see Yu et al. (2020). An increasingly large sequence of networks is 'dense' if T ∝ C and |E| ∝ C 2 as C → ∞, and it is 'light' if |E| = O C log (C) .

Capacities and Feasible Flows
Each edge e ∈ E has a nonnegative integer capacity, cap (e) ≥ 0. In our network structure, cap {(ς , τ t )} = cap { (τ t , ς )} = κ for t = 1, . . . , T, and all other edges have capacity 1. A feasible flow f (·) assigns a nonnegative integer to each edge e ∈ E subject to certain requirements. In words, the source ς supplies Tκ units of flow, the sink ς absorbs Tκ units of flow, and other edges pass along all of the flow they receive. Also, a feasible flow respects the capacity constraints, 0 ≤ f (e) ≤ cap (e) for each edge e ∈ E. Write E ⊂ E for the set of edges that include neither the source ς nor the sink ς, so E contains E = |E| − 2T edges e of the forms (τ t , γ c ), γ c , γ c , γ c , τ t . Formally, a feasible flow is a function f (·) such that Condition (5) says that the total flow into vertex b equals the total flow out from b if b is neither the source ς nor the sink ς . In a network like the second network in Figure 1, with some omitted edges, a feasible flow may or may not exist.
Recall that an edge (τ t , γ c ) ∈ E has capacity 1, so an integer flow has either f (τ t , γ c ) = 0 or f (τ t , γ c ) = 1. Proposition 1 discusses the matched sample According to Proposition 1, each feasible flow f (·) is a 1to-κ matched sample, where τ t is matched to γ c if and only if f (τ t , γ c ) = 1.

Proposition 1. A function f (·) satisfying (3) is a feasible flow in the network defined by (1)-(5) if and only condi-
where each τ t has κ controls γ c ; (iii) the two pairings pick the same sets of controls, C 1 = C 2 .
Proof. Consider claim (i). Because ς emits Tκ units of flow, and cap {(ς , τ t )} = κ for each t, the first equality in (4) This proves (i), and the proof of (ii) is analogous. To prove (iii), note simply that γ c is in

Some Traditional Distances
There are two distances, δ τ t ,γ c ≥ 0 and τ t ,γ c ≥ 0, measuring proximity of treated individual τ t and potential control γ c . Here, δ τ t ,γ c will appear on the left side of the network, and τ t ,γ c will appear on the right. A parameter λ ≥ 0 will control the relative importance of the two distances.
One traditional distance δ τ t ,γ c in conventional bipartite matching is the Mahalanobis distance between the observed covariates describing τ t and γ c ; see Cochran and Rubin (1973) and Rubin (1979). Another traditional distance is δ τ t ,γ c = ∞ if τ t and γ c have estimated logits of propensity scores that differ in absolute value by more than a positive caliper, > 0, and otherwise δ τ t ,γ c is a Mahalanobis distance focused on a few key covariates; see Rosenbaum and Rubin (1985). A traditional caliper, , is 0.2 times the standard deviation of the logit of the propensity score. This latter form of minimum distance matching within propensity score calipers is used as a traditional benchmark in the simulation for comparison with the new methods. Following a long but arbitrary tradition, we refer to the quadratic form as the Mahalanobis "distance, " and to its square root as the Mahalanobis "norm. " A caliper may be implemented as either a hard constraint, by removing edges with δ τ t ,γ c = ∞, or as a soft constraint with a penalty, by replacing ∞ by a large but finite number. A hard constraint accelerates computation by removing edges, but it may mean there is no feasible match. Yu et al. (2020) rapidly determine the smallest feasible caliper, . A soft constraint ensures feasibility by tolerating the minimal feasible number of violations of the caliper. In parallel, when exact matching for a nominal covariate is implemented by a penalty, it ensures feasibility by minimizing the number of pairs not matched exactly, the result being called near-exact matching. Other distances τ t ,γ c relevant to covariate balance are discussed later. Matching networks with a third layer of covariate categories, τ t → γ c → category , were used in Rosenbaum (1989, §3.2) to balance a nominal covariate without pairing for it. This method has been extended by  and Yu et al. (2020), among others. In , the third layer had a tree-structure that refined coarse categories, but it could have vastly more refined categories than there were controls, so the number of edges |E| could be vastly larger than C 2 . In their example, there were 176 × 2 14 = 2.88 million refined categories. In effect,  said: if balance is not exactly feasible, make up for this using near-by categories, that is, categories that represented the same coarsened category. The structure in Figure 1 replaces categories, by duplicated treated individuals, τ t , always with O C 2 edges in E, with three important consequences. If γ c , τ t is present only if τ t and γ c are in the same category, then the tripartite network implements fine balance, as in Rosenbaum (1989, sec. 3.2). If τ t ,γ c represents a distance in Pimentel et al. 's tree, then it produces refined balance with |E| = O C 2 . Most importantly, our preference for certain controls no longer needs to be expressed by a category to which those controls belong, but can instead combine several quantitative considerations, including calipers, an arbitrary earthmover distance among categories, and directional penalties for covariate imbalances.

Minimum Cost Flows
Each edge e ∈ E has a nonnegative cost, cost (e) ≥ 0. The cost of is a minimum cost flow if cost g ≤ cost f for every feasible flow f (·). In our network structure, Proposition 2. The cost of a feasible flow f in the network defined by Equations (1)- (6) is so a minimum cost flow picks out Tκ distinct controls to minimize a weighted sum of the two distances. For sufficiently large λ, a minimum cost flow minimizes In most of the examples in this article, τ t ,γ c is coarse, perhaps binary indicating violation or not of a caliper on the propensity, or perhaps a small integer that counts the number of violations of several desired conditions. If the τ t ,γ c are coarse, then there will often be many flows f (·) or matched samples that are tied in producing the same minimum value of (γ c ,τ t )∈P2 τ t ,γ c in (7). Perhaps, all of these tied flows minimize the number of violations of a propensity score caliper. If the τ t ,γ c are coarse, then a sufficiently large λ will pick the tied flow that also minimizes (τ t ,γ c )∈P 1 δ τ t ,γ c among tied flows. Coarse τ t ,γ c 's are emphasized in the current article. In contrast, varying a mid-sized λ allows exploration of the tradeoff of the two objectives, in parallel with Pimentel and Kelz (2020).
How does one find a minimum cost flow? How quickly can it be found? In a network with the structure defined by (1)-(5), an algorithm exists that finds a minimum cost flow in O |V| · |E| + |V| 2 log (|V|) arithmetic operations; see Korte and Vygen (2012, Theorem 9.13) . Viewed in this way, all of the new and existing matching algorithms we describe have worst-case time bounds of O C 3 in a dense network, and omission of sufficiently many edges can reduce that to O C 2 log (C) . In R, the callrelax function in Pimentel's (2016) package rcbalance uses the auction algorithm of Bertsekas (1991) and the Fortran code of Bertsekas and Tseng (1988) in Hansen's (2007) optmatch package to find a minimum cost flow. This algorithm has a different time bound, but its performance in simulations is competitive.

A Minimum Distance Match
The structure in Proposition 2 can produce the most familiar form of optimal match, namely a minimum distance bipartite match. Consider a complete network (V, E) with all of the edges in Equation (2); then the structure involving γ c , τ t does not constrain which controls are used or who is matched to whom. Additionally taking λ = 0 in Proposition 2 means that the γ c , τ t do not affect the cost. So, a complete network with λ = 0 is simply the κ-to-1 match that minimizes the total of the first distance, (τ t ,γ c )∈P 1 δ τ t ,γ c . An example is minimum distance matching within propensity score calipers, as described in Section 2.3.

An Example of the Distinction Between Tripartite and Bipartite Matching
The simplest example of tripartite matching separates the hard caliper on the propensity score from the Mahalanobis distance, placing the Mahalanobis distance in δ τ t ,γ c on the left, and the caliper on the propensity score in τ t ,γ c on the right. Here, Often, a Mahalanobis distance is used to ensure close matches on a few covariates highly predictive of outcomes, while the propensity score balances many other covariates. The balancing properties of propensity scores require a similar distribution of propensity scores in treated and matched control groups, but it is not essential that pairs be individually close on propensity scores. Bipartite minimum Mahalanobis distance matching within propensity score calipers forbids τ t ,γ c = ∞, but tripartite matching does not. With tripartite matching with a large λ, the τ t ,γ c on the right ensure that the match as a whole is compatible with a caliper ofthat is, the pairs could be broken and re-paired to satisfy the caliper with τ t ,γ c = 0 for every revised pair-however, the pairing on the left has ignored the propensity score and focused on the priority covariates in δ τ t ,γ c . Many pairs (τ t , γ c ) ineligible for the bipartite match because τ t ,γ c = ∞ are eligible for the tripartite match. So, δ τ t ,γ c will tend to be smaller in the tripartite match. Indeed, every bipartite matched control group that satisfies the caliper is in M = M f : f is a feasible flow as the outcome of a feasible tripartite match, but M contains other control groups as well, so the minimum of cost f over M is no larger, and often smaller than cost f restricted to bipartite matches that satisfy the caliper. Expressed differently, the bipartite match found one match P to minimize cost f = (τ t ,γ c )∈P δ τ t ,γ c + τ t ,γ c , but the tripartite match found two matches, P 1 and P 2 , containing the same controls to minimize so the latter is never larger than the former.

Double Calipers
Instead of eliminating calipers on the left, as in Section 3.2 , the calipers on the left might be widened, say to 0.5 standard deviations of the propensity score, while the calipers on the right are tightened, say to 0.05 standard deviations of the propensity score. This would force the marginal distribution of the propensity score to be more similar than in a bipartite match, while also increasing the emphasis on pairing for the key covariates in the Mahalanobis distance.
If Hansen's (2008) prognostic score has been estimated from another dataset, then the caliper for pairing on the left might use the prognostic score, while the caliper for balancing on the right might use the propensity score.

Fine Balance and Related Methods
Suppose that the individuals τ t and γ c in T ∪ C divide into N mutually exclusive and exhaustive nominal categories defined by covariates, ν : T ∪ C → {1, . . . , N}, so that, τ t falls in category ν (τ t ) and γ c falls in category ν (γ c ). Fine balance refers to having the same marginal distributions in treated and control groups for these N categories, without requiring that pairs be matched for the categories. For i = 1, . . . , N, define p i to be κ times the number of τ t in category i, so p i = κ × |{τ t ∈ T : ν (τ t ) = i}| , and define q i to be the number of γ c in category i, so q i = |{γ c ∈ C : ν (γ c ) = i}|. Let M ∈ M be a feasible 1-κ matched control group, and let q Mi = |{γ c ∈ M : ν (γ c ) = i}| be the number of controls in category i in the control group M. As discussed in Section 2.3, fine balance means that the treated group and the matched control group M have the same marginal distribution of the nominal covariate, that is, p i = q Mi for i = 1, . . . , N; see Rosenbaum (1989, sec. 3.2), where (τ t ,γ c )∈P 1 δ τ t ,γ c is minimized over all finely balanced matches. A match M is said to be "near-fine" if the total deviation from fine balance is minimized.  imposed the structure of a rooted tree on a nominal covariate with many levels, forcing more balance on priority distinctions near the root of the tree, tolerating larger imbalances, if unavoidable, on the many distinctions far from the root. These methods are special cases of Figure 1 with suitable distances τ t ,γ c . See Zubizarreta (2012) and  for more about fine balance.

Earthmover Distance
Near-fine balance merely counts imbalances, and minimizes the count. Often, we regard some nominal categories as close, others as far apart, and we prefer to make up a deficit in one category using a nearby category. The tree structure in  is one way to do this. In contrast, the earthmover distance accepts any specified distance between categories. The earthmover distance compares two marginal distributions and it refers to both the amount of probability mass-"earth"-that has to be moved, and the minimum distance that it must be moved, to transform one marginal distribution into the other; see Levina and Bickel (2001).
For instance, suppose the categories were formed by cutting a continuous covariate at various quantiles, and we are forced to tolerate some imbalance in the interquantile counts; then, we might prefer to make up a deficit from an adjacent quantile category. Alternatively, if many categories are formed as the interaction or direct product of many nominal covariates, and if fine balance for the interaction is infeasible, then perhaps we prefer to make up the deficit using a category that is correct for most of the nominal covariates, leading to a Hamming distance on the nominal categories.
With the 31 comorbidities in Section 1.1, the Hamming distance is not ideal. Although comorbidities are not rare among surgical patients in Medicare, most patients do not have most of the 31 comorbidities. A person with all 31 comorbidities would struggle to remain alive. If a person has 7 specific comorbidities, then the nearest person in terms of the Hamming distance is much more likely to have 6 of these 7, rather than these 7 plus one more, even though the Hamming distance is 1 in both cases. The Hamming distance tends to favor healthier controls. A better distance for comorbidities says: if we cannot have exactly the same 7 comorbidities, then 7 comorbidities are still required, but with the minimum number of substitutions. This substitution distance always entails an even number for the Hamming distance, because one comorbidity is added when another is removed.
In brief, nominal categories often have structure that may be encoded in a distance between categories. Considerations of this kind lead to the earthmover distance; see Levina and Bickel (2001). The earthmover distance is a distance between marginal distributions over categories, not a distance between paired individuals. Let η ij ≥ 0 be a distance between category i and category j, with η ii = 0 and η ij > 0 if i = j. If we cannot achieve fine balance with p i controls in category i, then we prefer to make up the deficit with controls from categories j that are close by, with small η ij . Because C ≥ Tκ, we have Tκ = p i ≤ q i , so that Tκ = p i = min p i , q i . Given this, the so-called earthmover distance between p 1 , . . . , , p N and q 1 , . . . , q N is defined (Levina and Bickel 2001) This definition does not require h ij to be an integer; however, because the p i and q i are integers, there do exist h ij that are integers that yield the minimum; see, for instance, Korte and Vygen (2012, Proposition 9.24). So, we may assume that the h ij that achieve the minimum are integers.
What is the relationship between the earthmover distance and the various forms of fine balance? If fine balance is possible, because p i ≤ q i for each i, then the earthmover distance is zero. To see this, simply take h ii = p i for all i, and h ij = 0 for j = i. Near-fine balance corresponds with the earthmover distance with η ij = 1 for i = j; then, the earthmover distance is (Tκ) −1 times the minimum total imbalance, N j=1 h ij η ij is positive because fine balance is not possible, then the integer h ij is the number of controls from category j who were counted as if they met the demand for p i controls in category i. By adjusting the η ij , the investigator can exert control over which substitutions are preferred when fine balance is infeasible.
Suppose that the network (V, E) is complete, with all |E| = 2T + C + 2TC edges in (2). Set τ t ,γ c = η ij if τ t is in category i and γ c is in category j, that is, if ν (τ t ) = i and ν (γ c ) = j. Then for a feasible flow f , the distance Therefore, minimizing (γ c ,τ t )∈P2 τ t ,γ c over feasible flows f minimizes the earthmover distance. Using Proposition 2, we have Corollary 3 which says the minimum cost flow algorithm returns a match that minimizes the covariate distance (τ t ,γ c )∈P 1 δ τ t ,γ c among all matches that minimize the earthmover distance.
Corollary 3. In a complete network, (V, E), for large enough λ, a minimum cost feasible flow f minimizes the total within pair distance, (τ t ,γ c )∈P 1 δ τ t ,γ c , among all matches that minimize the earthmover distance between the marginal distributions on the N nominal categories.

Directional Penalties
Directional penalties are asymmetric distances that work against the existing direction of the bias; see Yu and Rosenbaum (2019). For instance, instead of a symmetric caliper of ±0.2 standard deviations of the propensity score, one asymmetric caliper of the same length extends from 0.05 to −0.35 standard deviations; so, it tolerates a larger difference opposed to the direction of the bias. Yu and Rosenbaum (2019, Table 3) show that an asymmetric caliper can remove substantially more bias, but the degree of asymmetry must be adjusted for the data at hand. If the treated group is older than the potential controls, then a continuous directional penalty may be τ t ,γ c = αk if τ t is k years older than γ c ; here, k can be negative. Alternatively, a directional penalty may be τ t ,γ c = αk if τ t is k ≥ 0 years older than γ c or 0 if τ t is younger than γ c . In these cases, α must be adjusted to be compatible with λ in Proposition 2, so several matches with different α's would be compared; see Section 5.
Directional penalties have intuitive appeal, but they are also often instances of Lagrangians used to add a linear constraint to an integer program (Yu and Rosenbaum 2019, sec. 2.5). Conceived in this way, directional penalties attempt to force a balance constraint to be satisfied by a minimal adjustment to the objective function. The magnitude of a directional penalty must be titrated in an attempt to minimally satisfy the constraint without further altering the objective function.
A directional penalty may prefer a distant control who counter-balances a bias, rather than a nearer control who supports the bias. For this reason, it is attractive to place a directional penalty on the right side of Figure 1, so the directional penalty does affect the choice of controls, but not the pairing of treated and control individuals. There is no reason to pair a treated individual to a distant control, no matter how helpful that control may be in balancing the distribution of covariates.

Structure of the Simulation: Bipartite Versus Tripartite
The simulation considers multivariate Normal covariates with different mean vectors in treated and control groups, but with common covariance matrix. In this situation, the logit of the propensity score is essentially the linear discriminant. We may always linearly transform this situation so that the covariates are independent Normal variables with variance 1, the first coordinate being proportional to the linear discriminant, with expectation μ in the treated group and expectation 0 in the control group, and with the remaining coordinates having expectation 0 in both groups. In this process of simplification, we must remember that the investigator would not know that all the bias lies on the first coordinate, and would have to discover this approximately using some method of estimation. In the simulation, the investigator fits a linear logit model for the propensity score, and works with its fitted values, which involve all of the coordinates to some degree. We consider the case of μ = 1, which is commonly viewed as a large bias (Rubin 1979, expression (3.3)): the median in the treated group is at the 84% point of the control group, far above its upper quartile, so boxplots of the first coordinate would exhibit substantial separation. Relatively common treated individuals who are 1.5 standard deviations above the expectation in the treated group are 2.5 standard deviations above the expectation in the control group; so, these common treated individuals have available very few close controls. One thousand pairs are formed from 1000 treated individuals and 3000 potential controls. Each situation is replicated 1000 times.
There are 2 3 = 8 situations with a factorial structure. There are either 30 or 50 covariates; that is factor 1. There is a Mahalanobis distance that is based on 5 priority covariates. In factor 2, these five priority covariates are either (i) the estimated propensity score and covariates 2 through 5, or (ii) covariates 2 through 6. The distinction here is that covariates 2 though 6 exhibit no bias, whereas the estimated propensity score is much higher in the treated group. Finally, factor 3 distinguishes a situation with or without near-exact pairing for a nominal covariate with 16 levels formed by the quartiles of covariates 2 and 3. Near-exact matching uses a soft constraint-a penalized distance-in its effort to match exactly, thereby matching exactly as often as possible. As noted in Section 1.1, near-exact matching is important when studying effect modification, for instance, the importance of superior nursing for the quintile of patients at highest risk of death.
The situation before matching is compared to one standard bipartite matching method and four versions of tripartite matching. The standard method minimizes the total, withinpair Mahalanobis distances for individuals who differ by at most 0.2 standard deviations on the propensity score, where this caliper on the propensity score is implemented as a soft constraint to ensure feasibility. This is similar to the method suggested by Rosenbaum and Rubin (1985), except that here the within pair total distance is minimized.
In tripartite matching, pairing occurs on the left side of the network and balancing on the right side. The four tripartite methods remove the caliper on the propensity score, but on the right side of the network these matches add five near-fine balance categories defined by the quintiles of the propensity score in the treated group. Unlike the usual implementation of near-fine balance, the earthmover distance is used, with a preference for adjacent or near-by categories when fine balance is infeasible in a category. The distance is 0 for the same quintile, 1 for an adjacent quintile, 2 for skipping over one quintile, and so on. The four tripartite matchings differ in the weight they give to an additional linear directional penalty on the propensity score, again imposed on the right. The quantity in Equation (7) had weight 1 on the Mahalanobis distance on the left, and on the right weight 1000 on the earthmover distance plus α times the directional difference, treated-minus-control, in the propensity score, as in Section 4.3, for α = 0, 10, 100, 200. So, this weighting gives overriding priority to the earthmover distance and titrates the directional penalty. An investigator may achieve more than the simulation by titrating α, because α may be adjusted sample by sample, whereas the simulation uses the same α in 1000 samples.

Measures of Success: Balance or Proximity
There are five measures of success. Tables 1 and 2 report the average values of these measures over 1000 simulated samples; however, the measures are described here for one sample. The first measure is the absolute value of the mean difference, treated-minus-control, in the first coordinate, so it estimates μ = 1 before matching, and, by its definition, is in standard deviation units; call this DIF. Because the investigator does not know-as we do know-how to rotate the population distribution to achieve orthogonality, the investigator cannot calculate this measure from a matched sample; however, we can. The second measure, SS, is the sum, over 30 or 50 coordinates, of the squared difference, treated-minus-control, in the 30 or 50 coordinate means. Because the difference in means is squared, each of the 30 or 50 coordinates makes a positive contribution, but the difference in means has expectation zero for 29 or 49 coordinates, with only coordinate 1 exhibiting bias. Because DIF is one of the 30 or 50 differences in means, we always have DIF 2 ≤ SS, but often they are close because the remaining 29 or 49 differences in means have zero expectation. The quantity SS, like DIF, is dependent on the true rotation to orthogonality, and so is not accessible to an investigator. The quantities EP and TP are the absolute values of the treated-minus-control difference in the means of the estimated (EP) and true (TP) propensity scores, so an investigator could compute EP from a matched sample. Finally, MAHAL is the mean of the within-pair Mahalanobis distances (the quadratic form, not the norm) computed from the five priority covariates. When the five priority covariates are covariates 2-6, the expectation of MAHAL before matching is 10, but when the priority covariates are 2-5 plus the estimated propensity score, the expectation of MAHAL is larger as it involves bias, estimation error, and covariances between the estimated scores and covariates 2-5 which are part of the score.

Simulation Results
Tables 1 and 2 report the simulation results. In Table 1, the Mahalanobis distance δ τ t ,γ c includes the estimated propensity score but in Table 2 it does not. Although the true discriminant (DIF) and true propensity score (TP) are unknown to the investigator and are not used in the match, they are both better balanced by tripartite matching with α > 10 than by the standard bipartite method. The total balance on all 30 or 50 covariates (SS) is also better for tripartite matching with α > 10 than for the standard method. Even with 50 covariates, the total sum of squares of 50 differences in covariate means (SS) is quite small for α ≥ 100, about 2.5% of the value before matching, and less than half the value produced by the standard method. At the same time, the mean withinpair Mahalanobis distance (MAHAL) is smaller for tripartite matching than for the standard bipartite method. This pattern is consistent with the different but related theoretical discussion in Section 3.2, where moving the propensity score calipers to the right side reduced the Mahalanobis distance with the same caliper on the marginal distribution of the propensity score.
The estimated propensity score that the investigator can examine (EP) looks similar to the unavailable true propensity score (TP) in Tables 1 and 2. Despite matching for the estimated score, not the true score, the remaining bias after matching is ever so slightly larger for the estimated score, perhaps because of overfitting of the estimated score. In general, however, in Tables 1 and 2, the observable imbalance in the estimated score (EP) is a good guide to the unavailable imbalance in the true propensity score (TP).
As α moves from 0 to 200 in Tables 1 and 2, there is a slight trade-off between MAHAL and the other measures, but for α ≥ 10, tripartite matching is uniformly better than the standard method, by all 5 measures in all 8 situations. In Table 2, even without directional penalties, i.e., α = 0, tripartite matching is better than bipartite matching by every criterion. Tables 1 and 2 suggest that tripartite matching can remove a large initial bias hidden among 50 covariates, balance all 50 covariates, and pair closely for five priority covariates.

Expanded Simulation
Tripartite matching is a general technique that splits the task of matching into an aspect focused on close pairs and an aspect focused on balancing covariates. This simulation shows that a tripartite match can sometimes provide substantial gains compared to standard methods. In the simulation, the right side of Figure 1 uses near-fine balance for propensity score strata plus a directional penalty for the propensity score. This is one reasonable formulation for a simulation, but in any scientific application the meaning of the covariates would shape the match.
An online appendix contains an expanded simulation. Twelve nonstandard matching methods, using four α's, are compared: a bipartite method with a directional penalty, and two tripartite methods with the directional penalty on either the right or left side of Figure 1. See the appendix for specifics.
Generally, the directional penalty is substantially helpful, whether placed on the right or the left. For each α, the directional penalty performs uniformly better if placed on the right, but the gains vary in size with the sampling situation. Consider, for instance, α = 100 with 30 covariates in appendix Table A.2.1 (supplementary material), a near-exact match for the quartiles of two covariates, and a Mahalanobis distance based on covariates 2 through 6. Tripartite matching with the directional penalty on the right (tripartite II) has: (i) better balance measures than tripartite matching with the directional penalty on the left (tripartite I), and (ii) a smaller mean of the Mahalanobis distances than bipartite matching, 0.863 rather than 1.345, so it is 0.64 = 0.863/1.345 of the value obtain by bipartite matching. The complete simulation built 40,000 matched samples on a machine with 64 cores in five hours. On average, a single core (analogous to an inexpensive laptop) constructed a matched sample in 30 seconds.
An additional online appendix simulates outcomes and compares the mean squared errors of estimates of the average treatment effect for bipartite and tripartite matching. In this simulation, the mean squared error is dominated by the squared bias, and tripartite matching is typically more effective at reducing the bias, thereby reducing the mean squared error. This simulation also compares the standard error based on pair differences and the standard error that ignores the pairing, concluding that the pair differences should be used.

Comparing Matched Samples Produced by Different Methods
As we were writing this methodological paper, we were also extending the surgical study of Lasater et al. (2021a) to consider various nonsurgical conditions, including sepsis; see Lasater et al. (2021b). The study compares hospitals with superior nursing to more typical hospitals, controlling for other attributes of the hospital, including its size, its teaching status, and the presence or absence of advanced technology; see Lasater et al. (2020a) for precise definitions of these terms. Here, we compare three matched samples, M1, M2, and M3, where M1 is a conventional match, while M2 and M3 are two implementations of the new method. Table 3 lists 21 of the 86 covariates used in all three matches, and graphs will refer to all 86 covariates. As is traditional (Cochran and Rubin 1973), the standardized difference is the difference in covariate means divided by the square root of the equally weighted average of the treated and control variances of the covariate before matching, so matching changes the numerator but not the denominator. In Table 3, before matching, the treated and control groups were far apart in terms of the frequency of Hispanics and emergency admissions, and on hospital size, teaching status and technology. When a treatment effect may vary with an observed covariate, it aids analysis to match exactly for that covariate; see Hsu et al. (2013). Our past work with surgical patients suggested that superior nursing was most important for high risk patients, so to aid such comparisons we required an exact match for the quintile of risk of death within 30 days of admission, as estimated using a logit model applied to a different dataset. We also required an exact match for categories of teaching status, size, and technology status, as listed in Table 3. All three matches used these exact match categories and a Mahalanobis distance involving 40 of the 86 covariates judged to be clinically most important.
Besides, the exact-match categories, match M1 is a conventional minimum distance match within propensity score calipers. Match M2 and M3 removed the caliper on the propensity score on the left side of Figure 1. Instead, on the right side of Figure 1, the asymmetric caliper on the propensity score in Section 4.3 was used. In addition, there were ten fine-balance cate- gories defined by quintiles of the propensity score and the binary indicator of Hispanic, with an earthmover distance for the quintiles categories; see Section 4.2. Finally, match M2 was adjusted to produce match M3 by adding linear directional penalties for two recalcitrant covariates, namely emergency admission and the continuous covariate recording the number of beds in the hospital. The upper panel of Figure 2 shows the absolute standardized differences in covariate means for all 86 covariates. All three matches are acceptable, but M3 is uniformly better than M1 and M2. The directional penalties that produced M3 from M2 brought down the two extreme points in the boxplot for M2. So, match M3 is best in terms of covariate balance, the marginal distributions of covariates in the treated and matched control groups, ignoring who is matched to whom.
The lower panel of Figure 2 considers who is matched to whom. It shows the Mahalanobis norm-the square root of the quadratic form-comparing the covariates within each of the 25444 treatment-control pairs. The 25444 Mahalanobis norms tend to be smaller in M2 and M3 than in M1. By using a caliper to pair individuals, match M1 was forced to tolerate some very large Mahalanobis norms. By shifting the covariate balancing to the right side of Figure 1, match M2 and match M3 were never forced to accept a very large Mahalanobis norm. The 95th percentile of the Mahalanobis norms was reduced from 32.04 in M1 to 7.97 in M3. So, match M3 is better in two ways: better in terms of balancing covariates, and better in terms of forming close individual pairs.

Survival After Sepsis
The outcome is survival 30-day after admission for sepsis. Table 4 shows the outcomes, recorded in the manner of McNemar's test. Each count in Table 4 is a pair, not a patient, so the total count is 25444 pairs, not 2 × 25444 patients. Sepsis  Table 4 been the result of a paired randomized experiment, randomly assigning one patient in each pair to the superior nursing hospital in that pair, then using the method in Rosenbaum (2002) we would be 95% confident that at least 348 patients survived because they were treated at a hospital with a better nursing resources. Alas, this is not a randomized experiment, and modest selection biases could explain this difference: the P-value testing no effect could be ≥ 0.053 if there were an unobserved covariate that increased the odds of death by 50% and increased the odds of treatment at a hospital with superior nursing by 50%.
(This calculation uses the method in Rosenbaum (2002) with a device called amplification; see Rosenbaum (2017 , Table 9.1). The upper bound on the P-value is 0.053 at = 1.083 or equivalently at = 1.5 and = 1.5, as = 1.083 = (1.5 × 1.5 + 1) / (1.5 + 1.5) = ( + 1) / ( + ). See also Rosenbaum (2017 , Table 9.1).) Patients were closely paired for risk of death, estimated from another dataset. Among the 46% of pairs with highest risk, the point estimate is −2.04% lower mortality with superior nursing, and we are 95% confident of at least 208 lives saved; whereas among the 54% of pairs with the lowest risk of death, the point estimate is −1.3% lower mortality and we are 95% confident of at least 78 lives saved. Both of these comparisons are sensitive to modest selection biases.

Further Aspects of the Matched Design
The tripartite matching structure in Figure 1 separates tactics that ensure close pairing from tactics that ensure balanced marginal distributions. As explained in Section 3.2, and as demonstrated in the simulation and the application, separation of pairing and balancing can simultaneously reduce the total within-pair distance and improve covariate balance, when compared to conventional bipartite matching that tries to balance by pairing alone.
The structure in Figure 1 unifies and simplifies various existing forms of matching with fine balance, including near-fine and refined balance. It also substantially extends those methods to include the general earthmover distance among categories and the superimposition of several covariate balancing strategies on the right side of the network, as illustrated with two strategies in the simulation. The structure in Figure 1 has computational advantages. A nominal covariate built as the interaction of many nominal covariates can have far more than C categories, yet the network in Figure 1 has O (C) nodes no matter how many categories there are. This affects both computer speed and memory.
It is sometimes helpful to force certain controls into a matched sample, without fixing a priori their roles in the close pairing of individuals on the left or balancing categories on the right. For example, more or less by definition, there is typically a deficit of controls among individuals with the highest estimated propensity scores, so one may wish to force all controls with the highest estimated propensity scores into the match. In Figure 1, the central edges, γ c , γ c , had cost zero. To ensure that particular controls are included in the match, let these controls, γ c , γ c , continue to have cost zero, but increase the remaining C − costs, γ c , γ c , by a large constant. Yu et al. (2020) match in large data bases, creating a sparser but feasible network, eliminating edges that are obviously terrible matches, for instance, edges that connect individuals with very different propensity scores. Their method constructs tens of thousands of matched pairs in a single optimization, with extensive use of fine balance. Parallel techniques can be applied to Figure 1, with a caveat: both the left and right sides of the network must be sparse and jointly feasible.
We have focused on matching in a fixed ratio, say 1-to-1 or 1-to-2 matching. Pimentel, Yoon and Keele (2015) propose covariate balancing in a variable ratio. Essentially, they decide on the number of controls for each treated individual using the estimated propensity score; then, they separately match the 1-1 pairs, the 1-to-2 sets, and so on. Their strategy balances the covariates no matter what weights are used to combine matched sets of different sizes. It is straightforward to combine their strategy with the techniques in the current paper.

Analysis of the Matched Sample
Matching is part of the design of an observational study. How does design relate to analysis? The division of labor between pairing and balancing in Figure 1 is analogous to the division of labor between blocking and randomizing in a randomized block experiment, remembering of course that randomization balances unobserved covariates and matching does not. The pairing on the left in Figure 1, (τ t , γ c ), defines the matched pairs or blocks, and these blocks are used in the analysis. The balancing on the right in Figure 1, γ c , τ t , creates a pairing or blocking that is not used explicitly in the analysis. Pairing on the left, (τ t , γ c ), should emphasize key covariates expected to strongly predict outcomes.
If a misjudgment is made about which covariates strongly predict outcomes, so that the pairing on the right, γ c , τ t , strongly predicts outcomes but is ignored in analysis, various results suggest that sampling variability will seem larger than it is, so tests for no treatment effect will be somewhat conservative (e.g., Hollander et al. 1974, Theorem 1). If such a misjudgement occurs, then it can be corrected using covariance adjustment of a matched sample (Rosenbaum 2002). Rubin (1979) closely studied matching and covariance adjustment, in competition and when used together. He found that matching was more robust than regression, but regression was more efficient than matching when its Gaussian linear model was correctly specified. Also, when used together, Rubin (1979) found that regression adjustment of matched pair differences outperforms regression adjustment of matched samples ignoring the pairing. This suggests that close pairing for highly predictive covariates has value even when matching is combined with regression adjustment. Additionally, close pairing for highly predictive covariates reduces sensitivity to unmeasured covariates (Rosenbaum 2005;Zubizarreta et al. 2014).

Supplementary Materials
The supplementary material is the expanded simulation described in Section 5.4.