The efficient design of Nested Group Testing algorithms for disease identification in clustered data

Group testing study designs have been used since the 1940s to reduce screening costs for uncommon diseases; for rare diseases, all cases are identifiable with substantially fewer tests than the population size. Substantial research has identified efficient designs under this paradigm. However, little work has focused on the important problem of disease screening among clustered data, such as geographic heterogeneity in HIV prevalence. We evaluated designs where we first estimate disease prevalence and then apply efficient group testing algorithms using these estimates. Specifically, we evaluate prevalence using individual testing on a fixed-size subset of each cluster and use these prevalence estimates to choose group sizes that minimize the corresponding estimated average number of tests per subject. We compare designs where we estimate cluster-specific prevalences as well as a common prevalence across clusters, use different group testing algorithms, construct groups from individuals within and in different clusters, and consider misclassification. For diseases with low prevalence, our results suggest that accounting for clustering is unnecessary. However, for diseases with higher prevalence and sizeable between-cluster heterogeneity, accounting for clustering in study design and implementation improves efficiency. We consider the practical aspects of our design recommendations with two examples with strong clustering effects: (1) Identification of HIV carriers in the US population and (2) Laboratory screening of anti-cancer compounds using cell lines.


Introduction
Group testing has been used since the 1940s to reduce the cost of screening a population for disease, among other medical, industrial, and agricultural applications. One simple design [1]  the population to be identified, using far fewer tests than the population size. Substantial research has introduced more sophisticated (and efficient) designs and improved the efficiency of established designs, e.g. through optimization of group sizes, under both frequentist and Bayesian paradigms [2][3][4][5][6]. A large body of research has addressed the impact of misclassification on group testing [7][8][9][10]. Although a number of researchers have examined the impact of data clustering when using group testing to estimate prevalence [11][12][13][14], little research has focused on the important question of group testing design for screening individuals who are observed in clusters. The motivating examples for our design methodology come from two diverse areas of biological science. The first one is the identification of HIV carriers in the United States, where there is substantial geographic heterogeneity in prevalence across states, across counties, or within cities. The second is testing of novel anti-cancer compounds on the NCI-60 cell lines, where there is heterogeneity in the effectiveness of the compound between tumor types. In this setting, the compounds may be difficult to produce or otherwise expensive and pooling cells from lines of the same tumor type may conserve resources.
Screening for SARS-CoV-2 infection is another area where clusters may plausibly arise, and where group testing may be used to increase screening efficiency. Reversetranscription polymerase chain reaction (RT-PCR) tests are sensitive and specific for detection of SARS-CoV-2 infection, and it is established that RT-PCR testing can be used on pooled samples, potentially with group sizes as large as 100 individuals [15]. Pooled RT-PCR testing with the Dorfman algorithm has been implemented successfully [16], and more complex designs have been proposed for low-prevalence settings [15]. COVID-19 infections have a potential for clustering on several levels. In addition to broad geographic heterogeneity in community infection rates, clustering may occur on a much smaller scale. For example, if all students at an elementary school are tested weekly, the classes and grade levels form potential clusters due to the potential for transmission: a large proportion of one class may be infected due to close contact within that group, while another class may be entirely free from infection.
Lendle, Hudgens, and Qaqish [17] showed that for hierarchical and matrix group testing procedures, arranging positively correlated data in the same pool results in increased efficiency. Their paper assumed that the mean disease prevalence as well as the betweencluster heterogeneity is known and does not factor their estimation into design considerations. Our focus is on the practical problem of designing a screening procedure when little is known about the distribution of cluster prevalences. While we estimate cluster prevalences, these estimates are used only to choose the group sizes for subsequent group testing; our focus is on identification of cases, not prevalence estimation overall.
In this paper, we focus on efficient designs, where efficiency is determined by the expected number of tests per subject, using group testing procedures for screening when participants are clustered. We consider a set of efficient practical designs where the group size is determined either overall or for each cluster, requiring either overall or clusterspecific prevalence estimation. In either case, we assume that a small number of tests in each cluster are used to estimate prevalence and corresponding group size based on that prevalence. In Section 2, we formally introduce the data structure and briefly review the group testing algorithms under consideration. Section 3 addresses optimization of the number of cluster members individually tested to estimate the cluster-specific prevalences, by cluster size and distribution of the true cluster-specific prevalences. In Section 4, we compare group testing algorithms for clustered data using simulation studies, as a function of overall disease prevalence, variability of prevalence across clusters, cluster size, and number of clusters. Section 5 additionally incorporates misclassification affecting test sensitivity (dilution). Section 6 considers the practical aspects of our design recommendations with two examples with strong clustering effects: (1) Identification of HIV carriers in the US population and (2) Laboratory screening of anti-cancer compounds using cell lines. In Section 7, we discuss the implications of our results for practical applications.

Notation and assumptions
Suppose that our data consist of n clusters, with disease prevalences P i , i = 1, . . . , n, where the P i are i.i.d. Beta random variables with parameters α and β and mean p = α / α + β . Each cluster contains m i individuals with a single binary trait X ij , i = 1, . . . , n, j = 1, . . . , m i , where the X ij are Bernoulli random variables, conditionally independent given P i . This is the standard Beta-Binomial model used by Lendle et al. [17].
We consider a series of different two-step procedure designs where we first estimate P i , i = 1, . . . , n, by individually testing a subset of each cluster, and then choose an efficient group testing design given these estimates. For the estimation step of this procedure, we assume that l i (l i < m i ∀i = 1, . . . , n) individuals are drawn from each cluster and individual testing is done in order to estimate P i . The random variable X i , denoting the number of cases in this subsample, is assumed to also follow a Binomial distribution given P i . We use four methods to estimate P i :p Under the Beta-Binomial distribution,p i is the posterior mean under a Uniform(0, 1) (Beta(1, 1)) prior for P i ,p i is the posterior mean under a Beta prior with known parameters α and β, and p i is the posterior mean under a Beta prior where α and β are replaced by their maximum-likelihood estimatesα andβ, respectively. The estimatorp is the maximum-likelihood estimator of P i under the assumption of a constant prevalence across participants (P i = P). A cluster-specific group size is chosen based on P i estimated in Step 1 and using optimality results presented in Malinovsky and Albert [3] and references within.
In the second step of the two-step procedure, we apply the estimated group sizes from Step 1 to identify cases in the remaining m i − l i individuals from each of the n clusters. We examine the expected number of tests under several group testing algorithms, presented in Section 2.2. Specifically, we focus on three members of the class of nested group testing algorithms [5], where once a group tests positive, the next subgroup tested is a proper subset of that group.

Group testing algorithms
The Dorfman procedure (Procedure D) was introduced by Robert Dorfman [1] for the administration of syphilis blood testing of Army draftees during World War II. It is the simplest and most easily implementable group testing algorithm. A single test is first applied to pooled samples from a group of size k. If this test is negative, all k group members are classified negative; otherwise, each group member is tested individually to identify cases within these k. Given disease prevalence p (q = 1 − p) and a perfect test (sensitivity and specificity of 100%), the expected number of tests per individual E D (T) may be calculated: with the optimal group size k * D ≡ k * D (p) [18]; A derived closed-form expression for the optimal group size k * D is unavailable, but k * D ≤ k * D for p < 1 − 1 3 1 3 , and a conjectured closed-form expression has been verified numerically [3].
An algorithm introduced by Sterrett [19] (Procedure S) improves further upon D´ [3]; if a group is positive overall, its members are tested sequentially one-by-one until the first case is found. The remaining group members are then pooled and re-tested; if this subgroup tests negative, all its members are classified as negative, and if it is positive, the procedure is repeated. These steps are repeated until all group members are classified. Although more logistically challenging in practice, Procedure S provides an improvement in efficiency over D and D´: An expression for the optimal group size under this design, k * S , is presented by Malinovsky and Albert [3]. Procedure S has been empirically shown to dominate D´, although this has not been proven in general [20].
The generalized group testing problem (GGTP) arises when designing a group testing procedure for w individuals, with corresponding probabilities p = p 1 , . . . , p w of being positive (q i = 1 − p i ). Clustered data are a subset of this general scenario, in which the set of w individuals is partitioned into m subsets defined by the clusters and all individuals within a given subset have a common probability of being positive. We may use GGTP results to calculate the expected number of tests in our setting; Equations S1-S3 in the online supplement give the expected number of tests per individual under D, D´, and S, respectively, for any subset of size k ≥ 1.

Group testing with dilution
Test misclassification is an important practical concern for group testing, particularly dilution effects in which increasing the group size reduces the assay sensitivity. Hwang [8] introduced a function to model this dilution effect (Eq. S4). Although the literature contains research pertaining to other forms of misclassification in a GGTP setting [21,22], Hwang's dilution function does not have a straightforward adaptation for the GGTP setting; we assume that groups are comprised of individuals with a common p i . Hwang also introduced an expected-cost function for the Dorfman procedure in a setting in which the total number of individuals is divisible by the group size (no-residual setting). This cost function may be optimized for k in place of the expected number of tests, based on the unit test cost c (Equations S5, S6), to serve as an alternative objective function to minimize when selecting the group size.As, in the presence of dilution, group sizes obtained by optimizing E(T) without consideration of test accuracy measures are anti-conservative, we introduce two additional quantities: The ratio of the expected number of correct classifications to the total expected number of tests ( E(T C ) E(T T ) ) and the ratio of the expected number of missed cases to the total expected number of cases ( E(M) E(D) ) (Equations (S7)-(S10)).

Choice of l
In the absence of test misclassification and for a given number of resolved individuals used to estimate cluster prevalence l i , cluster size m i , and Beta distribution parameters α and β, we can obtain the expected number of tests per individual, E(T), through successive use of the Law of Total Expectation. In general, selecting such an l i is a balancing act; a larger l i allows for more precise estimation of P i within a cluster, at the cost of increasing the number of individuals resolved prior to group testing. Additionally, our estimatesp i ,p i , and p i are bounded; their minimum values are 1 l i +2 , α l i +α+β , andα , respectively, all of which may be much larger than p i .
We denote the observed number of cases among the l i resolved individuals by x i and the true case prevalence in the cluster by p i , realizations of the random variables X i and P i respectively. For simplicity, in this section, we assume that the m i − l i is divisible by the estimated optimal group size k * i ≡ k * i (x i ). Without loss of generality, we may calculate the expected number of tests per individual for a single cluster of size m i . Additionally, the below expressions are in terms ofp i althoughp i or p i may be substituted. Using the properties of conditional expectation, we calculate the expected number of tests under Procedure D, D´, or S: In evaluating expression B from Equation (8), we must take care: This term is the expectation under the true P i = p i ; Procedure D, D´, or S; and respectively using groups of sizê Notably, the design elements of Equation (9) are defined byp i as estimated from x i , while the expectation is taken relative to the true prevalence p i . Given P i = p i , X i is a Binomial(l i , p i ) random variable, and so we can evaluate expression A from Equation (8), the expected number of tests under procedure D, given P i = p i .
Finally, P i is a Beta(α, β) random variable and we may find the overall expected number of tests given α, β, the cluster size m i , and the number of resolved individuals l i : For our results, we used numerical techniques to obtain values of E(T); namely, we took 0 = p 0 < p 1 < . . . < p 200 = 1 to be a sequence of evenly spaced points and calculated: For a given α, β, and m i , a local optimum of l i can be obtained by calculating E(T) for a series of candidate values of l i , say {2, 3, . . . , 30}, and taking l * i to be the value of l i which minimizes E(T). Empirically, l * i was a unique minimum of E(T) in all simulations. These calculations may also be performed under Procedures D´and S; in Equation  Across all values of α and for all three procedures (and for each ofp i ,p i , and p i ), we found that for small-to-moderate cluster sizes (approximately m i ≤ 200), the optimum l * i in our setting was nearly invariant to the true mean prevalence. This is reassuring from a design standpoint; in practice, it's unlikely that α and β will be known precisely by researchers, while cluster sizes are likely to be known, indicating that an acceptable number of individuals to resolve may be chosen based on the cluster size and group testing procedure alone. Results were comparable forp i and p i . Averaging across a range of Beta distributions, we found that for different estimates of p i and group testing procedures, l i = 8 performs well ( Figure S1 in the online supplement).
While the above calculations are performed for a single cluster, the results hold for multiple clusters of the same size or differing sizes; under the former, each cluster has the same value of l * , while under the latter, each cluster has an optimum l * i computed based on its size m i .
The calculations above assume that the size of the remaining cluster is divisible by the estimated group size k * i . However, in practice this assumption is unlikely to hold, and the overall partition of groups within each cluster must be adjusted to distribute residual cluster members among groups (finite-sample algorithm adjustment). Using the algorithm adjustment established for the standard setting, the groups in the adjusted partition differ in size from k * i by at most one individual. We assessed the impact of this assumption through simulations and find that the optimality results for l * i are only minimally affected by cluster residuals ( Figure S2). Section A3 of the supplementary material and Figure S3 examine the choice of l i when usingp.

Group construction and cluster integrity
In addition to comparing group testing algorithms and prevalence estimators, we also investigate the handling of individuals during construction of the group testing partition; Lendle, Hudgens, and Qaqish [17] showed that arranging positively correlated data in the same pool results in increased efficiency, but in applications this may add logistical challenges as described below. To investigate the amount of efficiency gain, we use simulations to consider four possibilities, illustrated by the following example (as well as Table  S1). Suppose that biological samples are being assessed to identify cases of HIV infection, with clustering formed by geographic region (e.g. state or county). The lab analyzing these samples could then: (1) Estimate HIV prevalence within each region (p i ,p i , or p i ) and test groups each of which consists exclusively of samples from a single region. (2) Estimate HIV prevalence within each region and test groups that consist of samples from regions with the same prevalence estimate (p i ,p i , or p i ). (3) Estimate a common HIV prevalence across all regions (p) and test groups each of which consists exclusively of samples from a single region. (4) Estimate a common HIV prevalence across all regions and test groups without regard to region.
We may consider these scenarios as using the cluster structure in both design and implementation, in design only, in implementation only, and in neither. Scenarios 1 and 3 pose more logistical challenges than 2 and 4: As samples arrive at the lab for testing, the lab must wait for sufficient samples from each region to accrue before conducting a test for that region, while in Scenario 2 samples may accrue into a group from multiple regions simultaneously and in Scenario 4 they may accrue from all regions simultaneously.
For our simulations, we refer to this implementation-level sample handling as 'individual handling' and refer to scenarios 1 and 3 as maintaining clustering and 2 and 4 as ignoring clustering. In practice, Scenarios 1 and 3 may combine some samples across clusters, within a supercluster (1) or overall (3), to reduce the number of residuals within each cluster.

Simulation design
Using simulations, we assessed the differences between a number of group testing algorithms for clustered data. Broadly, we may categorize the algorithms used based on (1) individual handling (cluster structure 'maintained' vs 'ignored'), (2) group testing procedure, and (3) estimation of P i , as summarized in Table S1. We estimate P i asp i (Equation (1)),p i (Equation (2)), p i (Equation (3)), orp (Equation (4)).
Using cluster-specific prevalence estimatesp i (orp i or p i ), we may apply a group testing algorithm with finite-sample adjustment to group sizes when the cluster size is not evenly divisible by the group size (Procedures D, D´, S) to superclusters comprised of all clusters with the samep i (individals X wj from clusters w such thatp w =p i ), for each uniquep i , usingp i to estimate the group sizes k * i . Underp, we apply Procedure D, D´, or S to all individuals, with finite-sample adjustment, with an estimated group size k * calculated usingp.
For the 24 group testing designs summarized above and in Table S1, we calculated E(T) and Var(T) for n clusters of size m, with prevalences following a Beta(α, β) distribution through simulations. Using Procedures D, D´, and S on superclusters (18 designs), we followed the following procedure: (1) Calculate l * for m, α, β, the procedure (D, D´, or S), and the prevalence estimator (P i , P i , or P i ). As all clusters were the same size, we calculated a common l * i = l * . . In other words, for a commonly estimated prevalence, we randomized across all clusters, and for cluster-specific prevalence estimates, we randomized across all clusters that share a common estimated prevalence. (c) Construct a partition of groups for each supercluster using its prevalence estimate and corresponding optimal group size k * (x j ). For Procedures D, D´, and S usingp (6 designs), we follow the steps above, using l * calculated for each procedure underP; Step 2 proceeds with all n clusters combined in a single supercluster with estimated prevalencep.
In addition to the 24 group testing designs summarized above and in Table S1(bc), we varied the overall size of the data set (n × m ≈ {500, 1000, 2000}), number and size of the clusters (10 clusters; clusters of size 10; n ≈ m), mean prevalence p = α α+β ∈ {0.01, 0.05, 0.1, 0.2, 0.3}, and α ∈ {0.5, 0.75, 1, 1.5, 2}. In all cases, we simulated 100,000 cohorts. Table 1 provides an overview of the simulation results most relevant to practical study design: use of cluster structure in the design and in implementation of the group testing procedure (Section 4.1.1). This table presents the expected number of tests per subject for Table 1. Expectation E(T)of the number of tests per subject, overall by use of cluster structure in the design (use ofp i vs.p) and use of cluster structure in implementation ('Maintained' vs. 'Ignored'), mean prevalence p, and Beta shape parameter α, for Procedures D, D´and S, n = 31, m = 32.   Table 1. small (p = 0.01), moderate (p = 0.1), large (p = 0.3), and very large (p = 0.5, bimodal Beta distributions) mean prevalences, three Beta shape parameters (α = 0.5, 1, 2), and a variety of study designs (procedures D, D´, and S;p andp i ; cluster structure maintained or ignored during group construction). Simulated prevalence distributions are illustrated in Figure 2. Across all three algorithms, estimating cluster-specific prevalences improved efficiency (E[T]) for high overall prevalence or a bimodal prevalence distribution; otherwise, the boundedness and imprecision ofp i outweigh the ability to calculate group size based on clusters' individual estimated prevalences. Maintaining cluster structure when constructing groups increases efficiency in every scenario for Procedures D and D´, while Procedure S occasionally is more efficient when cluster structure is ignored setting due to the asymmetry in its expected number of tests. Overall, however, efficiency was nearly equivalent between the two implementation paradigms.

Results
Tables S2-S9 expand the results of Table 1 to a wider range of values and provide the results of additional simulations, varying the overall size or composition of the simulated data, varying the cluster prevalence estimator, and examining specific extreme distributions. Applying the group testing algorithms using the true p or p i (Table S2), we observe the established efficiency rankings between procedures. Substitutingp for p reduces efficiency overall (Table S3), and estimating cluster-specific prevalences withp i further reduces efficiency due to imprecision and boundedness of this estimate (Table S4). For equal values of p, differences in E(T) across varying Beta distribution parameters were small, indicating that the choice of study design can be made using knowledge of p but without precise values of α and β (Table S5). Performance is roughly equivalent regardless of individual handling during group construction, across a wide range of parameter values (Table S6). Increasing n for fixed m does not measurably improve E(T) but does reduce σ (T), while increasing m for fixed n improves E(T) but not σ (T) ( Table S7). Comparing cluster prevalence estimators,p i is more efficient thanp i and p i ,which perform equivalently (Table S8). For α = β ≤ 1, group testing usingp i is more efficient than single-unit testing, as we can categorize such clusters as either very high or low prevalence ( Table S9a). These results hold for mixture distributions with higher heterogeneity than standard Beta distributions and for bimodal mixture distributions with modes at values other than 0 and 1 ( Table S9c), and when varying β rather than α (data not shown).

Dilution simulations
We used simulations similar to those described in Section 4.1 to evaluate the impact of dilution on Dorfman group testing designs with clustering; full details on these simulations are given in Section A4 of the supplementary material. Table 2 and Tables S10-S14 provide the results of these simulations.
Briefly, introducing dilution without accounting for it in the choice of group size nominally increases efficiency by reducing the total number of expected tests. However, this reduction is due to missing cases during screening. Usingp i rather thanp reduces efficiency, for small p, but may provide a more tolerable rate of missed cases (Table S10). Accounting for dilution reduces group size, with smaller groups for a higher unit test cost parameter c (Equation S5, S6) (Table S11), correspondingly increasing the number of tests but decreasing the proportion of missed cases. When usingp i , the relative proportion of missed cases decreases with cluster size, but no clear relationship holds forp (Table S12). The proportion of missed cases was consistent across different choices of α (Table S13). For α = β ≤ 1, not only is group testing usingp i more efficient than single-unit testing, but few cases are missed as they are concentrated within high-prevalence clusters that receive individual testing (Table S14).

HIV prevalence
We used data tabulating HIV prevalence rates by state (including Washington DC), county, and ZIP code in a simulation study using observed clusters to examine the performance of our algorithms in a large-scale public health setting. Identification of HIV-positive people, particularly while they are asymptomatic, has immense public health value, as HIV-positive individuals can receive anti-retroviral therapy to manage the infection, and individuals' knowledge of their HIV status helps to reduce the spread of the disease. Currently, the US Preventive Services Task Force recommends that clinicians screen for HIV infection in adolescents and adults aged 15-65 years and in all pregnant women. Pooled blood samples have previously been used to reduce the cost of screening for acute HIV infection in low-prevalence populations and of screening for failure of anti-retroviral therapy [23,24]. Additionally, standard rapid serum antibody assays have been shown to retain their high sensitivity when diluted to 1:20; false negatives from pooled samples would have been false negatives in individual testing by the same assay [25]. HIV prevalence rates from 2016 were obtained from the AIDSVu interactive online mapping tool, which compiles state and county HIV prevalence rates from the CDC Division of HIV/AIDS Prevention and ZIP code prevalence rates directly from state and local health departments [26]. The CDC estimated rates include people with unknown HIV infections, while the local rates include known diagnoses only. Statewide HIV prevalence rates are highly variable, ranging from 74.4 per 100,000 individuals in Wyoming to 2831.6 per 100,000 in Washington DC. Similarly, available data from counties range from 14 to 2306 per 100,000 (in Butler County, PA, and Union County, FL, respectively). For our analysis by ZIP code, we used data from Atlanta, GA, which range from 105 per to 7464 per 100,000 (ZIP codes 30,041 and 30,303, respectively) ( Figure 3).
Since the AIDSVu database contains HIV prevalences by region, but not individual-level data, we used the 2016 prevalences to simulate cohorts and then applied the simulation methods described in Section 4.1.2 to obtain E(T). We obtained l * by estimating α and β from the respective data sets; in a real-world HIV screening situation, researchers would likely be able to estimate the overall distribution of HIV prevalence rates from recent geographic prevalence data despite not knowing the exact present location-specific prevalences, as drastic and sudden shifts in the overall prevalence distribution are unlikely. For each geographic grouping, we simulated 50,000 cohorts, with clusters of size m = 5000 for the states, m = 500 for the counties, and m = 50 for the ZIP code regions. For Procedures D, D´, and S, usingp was more efficient than usingp i ; despite the heterogeneity in  HIV prevalences by region, prevalence is still low enough overall to favor group testing algorithms that do not account for clustering (Table 3).

Cell lines
The NCI-60 cell lines encompass 60 different human tumor cell lines that are used to identify and characterize novel compounds for anti-cancer activity, as measured by growth inhibition or killing of tumor cells [27]. These data are publicly available using the online COMPARE database provided by the National Cancer Institute (NCI) Division of Cancer Treatment and Diagnosis (DCTD) Developmental Therapeutics Program (DTP) [28]. Group testing could plausibly be used in this setting as the compounds tested can be rare and difficult to harvest and/or manufacture; it may be feasible to test compounds on groups comprised of pooled cells from lines of the same tumor tissue type (e.g. breast), measure the anti-cancer activity within that pool, and retest individual lines if any activity is seen overall. As heterogeneity is expected across tumor types, we can consider this to be a clustered data setting, where the clusters are defined by the nine types of tumor present in the NCI-60 lines (Breast, Central Nervous System (CNS), Colon, Leukemia, Melanoma, Non-Small-Cell Lung Cancer (NSCLC), Ovarian, Prostate, and Renal). As a binary measure of anti-cancer activity, we used GI50 data recorded on the NCI-60 lines, which measure the concentration that causes 50% growth inhibition [29], dichotomized such that log 10 of concentration values ≤ −6 indicated the presence of activity, while > −6 indicated the absence. We investigated three compounds: bulbophyllanthrone (NSC-708791), ethoxycurcumin trithiadiazolaminomethylcarbonte (NSC-742020), and carboxyphthalato platinum (NSC-271674). In each case, due to the small size of the clusters (range 2-9, median 7 cell lines), we used l = 2 and estimatedp i for each cluster andp across all clusters. We then applied Procedures D, D´, and S to the remaining data, using estimated optimal group sizes based onp i andp, and recorded the observed number of tests (and corresponding observed average number of tests per subject). Due to the potential for interaction between cell lines, we did not group cell lines from different tumor types. In this example, exact cluster sizes and individual testing results were available for each cluster, compound, and cell line, and there is no inherent ordering to the individuals within the clusters. In order to assess overall algorithm performance for these data, we repeated our calculations across 5000 randomized orderings of the individuals within the clusters and present the average across these permutations of the data. Table 4 shows the observed rates of anti-cancer activity for these compounds across the nine tumor types and the average number of tests per subject across these 5000 permutations of the data. For NSC-708791,p was more efficient thanp i for all three algorithms; despite the heterogeneity between cell lines, the overall proportion of cell lines for which the compound is active is low.p i provided higher efficiency for Procedure D and both

Discussion
In this paper, we have proposed different design strategies for disease screening using group testing when the population of interest has clustering. Using two motivating examples that cover a broad range of applications in the biosciences, we show the importance of our design results in practical situations. We considered different group testing algorithms, different approaches to estimating cluster prevalences, different ways to combine the individual samples into groups, and dilution effects. Counter to our intuition, we found that in most situations, under our framework, estimating individual prevalences for determining group sizes did not result in increased efficiency relative to obtaining and using an overall estimate of prevalence. This is particularly true when the disease prevalence is small ( < 0.20) and the cluster variation is not extreme. However, when the prevalence and variability between clusters are high, there can be sizeable efficiency gains, relative to single-unit testing, from accounting for cluster structure in both design and implementation. Group testing is currently rarely used in practice when disease prevalence is high, but an approach incorporating cluster structure may still improve efficiency when both the overall disease prevalence and inter-cluster heterogeneity are high, allowing for single-unit testing on clusters with high prevalence and group testing on clusters with low prevalence.
In order to address the scenario where test sensitivity varies with group size, we incorporated dilution in simulations for the Dorfman design; we considered only this design due to its practical and simple approach, which is easily implemented in real-world scenarios. Dilution introduces an additional consideration in comparing designs; in addition to minimizing the number of tests, one can consider reducing the number of missed cases. This may be done by changing the objective function for the choice of the group size. However, if one does not want to specify additional parameters, using cluster-specific prevalence estimates for the choice of group sizes can provide this reduction as, on average, the boundedness of the prevalence estimates reduces the group sizes. This may provide a 'balancing act' in design selection; by the results discussed above, estimating cluster-specific prevalences often reduces efficiency in terms of number of tests, but it may also reduce the number of missed cases in a setting with dilution. For example, if a small amount of dilution (d = 0.05) is present but unknown or unaccounted for, using cluster-specific prevalence estimates can nearly halve the number of missed cases (Table S12b).
There exist complex dynamic programming algorithms to find optimal and efficient groupings in the GGTP setting if the individual prevalences are known (Procedure D: Hwang [30]; Procedure S: Malinovsky [31]). These methods perform equivalently to simply applying Procedures D and S on superclusters, due to the discreteness of the distribution of p i . There also exists an optimal nested group testing procedure for unclustered data (Sobel and Groll [5]), but it is not optimal for clustered data. Our work assumes that nothing is known a priori regarding the exact cluster prevalences, although the distribution of cluster prevalences may be known. This is not necessarily the case in all settings; for example, historical data may provide limited information about prevalence for some or all clusters.
In such a situation, it may be possible to use this information to estimate l * i as well as the current cluster prevalence, or to use group testing when estimating cluster prevalence.
We considered different ways to formulate groups, including composing groups from members of the same cluster, composing groups from individuals with the same estimated cluster-specific prevalence (i.e. within a supercluster), and disregarding clustering entirely. Similar to the issue of cluster prevalence estimation (i.e. cluster specific versus population), disregarding clustering when formulating groups does not result in substantial efficiency loss when the disease prevalence is small and the heterogeneity is not enormous.
We considered a design where we estimate the prevalence (either cluster specific or common) based on a small number of individual tests taken on each cluster, followed by applying group testing procedures to the remaining individual tests for each cluster. This test was chosen for its practical simplicity. However, Bayesian adaptive group testing designs could be developed where design choices (e.g. group size) are updated with increasing amounts of information as more testing is done in each cluster [6]. In order to calculate the optimal number of individual tests to perform on each cluster, we assume that the cluster prevalences follow a Beta distribution. However, the group testing may be conducted using nonparametric prevalence estimates. For a non-Beta prevalence distribution, the number of individual tests l used for estimation may no longer be optimal, but Figure S1 indicates that this value is primarily dependent on cluster size and group testing procedure, and thus even a non-optimal l for the appropriate procedure and cluster size is likely reasonable despite distribution misspecification.
Overall, our recommendation is that in most scenarios it is not necessary to account for clustering when performing group testing on clustered data; it is sufficient to estimate a single overall prevalence across all clusters and use this estimate for the choice of group size. The boundedness of the prevalence estimator distribution result in inaccurate estimation of low cluster prevalences (and corresponding optimum group sizes) when using a reasonable number of observations for estimation. Further, its discreteness results in even non-negligible amounts of between-cluster heterogeneity being represented by a limited number of possible (and more limited number of probable) group sizes. However, there may be efficiency gain in incorporating cluster-specific prevalence estimation into the design under extreme heterogeneity between clusters with not too small overall prevalence (i.e. a non-negligible fraction of clusters with prevalences on either side of the threshold for favoring single-unit vs. group testing), and if one suspects the presence of dilution, estimating prevalence cluster-by-cluster may provide a reduced number of missed cases.

Funding
This work was supported by the Intramural Research Program of the National Cancer Institute. The views presented in this article are those of the authors and should not be viewed as official opinions or positions of the National Cancer Institute, National Institutes of Health, or US Department of Health and Human Services. The research of YM was supported by grant number 2020063 from the United States-Israel Binational Science Foundation (BSF), Jerusalem, Israel.

Data availability statement
The data that support the findings of this study are openly available from the AIDSVu web portal at https://aidsvu.org and the National Cancer Institute Division of Cancer Treatment and Diagnosis Developmental Therapeutics Program Public COMPARE portal at https://dtp.cancer.gov/ databases_tools/compare.htm.