Towards U-statistics clustering inference for multiple groups

Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods that assess statistical significance have recently drawn attention due to their role in identifying patterns in high-dimensional data with applications in many scientific fields. Towards developing a general framework for clustering in multiple groups, we present here a U-statistics-based approach, specially tailored for high-dimensional datasets, that clusters the data into three groups while assessing the significance of such partitions. We also consider theoretical aspects of allowing for an outlier group. Our approach stands on the U-statistics-based clustering framework of the methods in R package uclust and inherits its properties being a non-parametric method relying on very few assumptions about the data. Thus it can be applied to a wide range of datasets. Furthermore our method aims to be a statistically powerful tool to find the best partitions of the data into three groups when that particular structure is present. To do so, we first propose an extension of the test U-statistic and develop its asymptotic theory. Additionally we propose a ternary non-nested significance clustering method. Our approach is tested through multiple simulations and is shown to be comparable or have more statistical power to competing alternatives in all scenarios considered. An application to image recognition data showcases our method.


Introduction
In clustering analysis, the aim is to divide data into groups of similar items.A large number of algorithms based on different approaches have been proposed to accomplish this task, potentially leading to different results [1].Clusters can be inherently present in the data, like in phylogenetic analysis [2,3], or they can be built because of context-specific demands regardless of whether some underlying cluster structure is present, as in customer segmentation [4,5].A critical issue is how to discover inherent cluster structure in data, in other words, whether the clusters represent in fact an important feature or are simply the result of sample variation.This becomes even more challenging when considering the context of high-dimensional data.We present here a U-statistics-based approach that clusters the data into three groups while assessing the significance of such partitions.
Our method is specially tailored for high-dimensional data and adaptable to different dissimilarity measures.
In a typical application of inference in clustering when the groups are predefined and there is no need for an algorithm or method to find them, the null hypothesis is typically that all groups are random samples from the same population (overall sample homogeneity).In the multivariate analysis of variance (M)ANOVA procedure, when presented in terms of a linear model, the homogeneity of groups stands for the equality of means between all groups.Assumptions of independence and normality of the data, homoscedasticity of variance and homogeneity in group are required for exact finite sample inference.In addition, a large sample size, depending on the dimension of the data is generally necessary.
Several other approaches have been proposed to assess statistical significance in clustering.Examples include the procedure presented in [6] which considers mixture models of distributions such as the Gaussian and a maximum likelihood approach used in [7] to test a no-clusters hypothesis.However, when the data are high dimensional and have small sample sizes, complete parametric estimation becomes increasingly challenging, usually requiring costly matrix inversions.[8,9] address this issue through dimensionality reduction and sparse covariance matrix estimation.[10] propose an approach inspired on the bootstrap which is implemented in the R package pvclust [11] and frequently used in phylogenetics to assess confidence in hierarchical clustering.[12] propose a statistical test to assess the significance of clustering the data into K groups, specifically tailored to the high dimension low sample size (HDLSS) scenario, that has been implemented in the R package sigclust.[13] further develop the methodology through a soft thresholding approach employed to address inflated type I errors, leading to significant sigclust improvements.However, the implementation considers only two groups.Furthermore, [14] extend their method to assess significance in hierarchical clustering.This approach assumes that the data come from a single multivariate normal distribution, which can be an issue since rejection of the no-clusters hypothesis may be a simple consequence of non-normal data.
While most of the above-mentioned methods assume predetermined cluster assignments for testing, [15] propose a test of overall sample homogeneity for HDLSS data in which no group affiliations are previously informed.The method builds on the U-statistic theory developed in [16] and searches among all data partitions into K = 2 groups for any significant partitions indicating sample heterogeneity.[17] then propose a significance clustering method in this context, named uclust, which returns the significant partition that best separates the data, if such partition exists.Additionally, they extend the underlying U-statistic B n , which originally requires that each group has at least two elements, to accept groups of size one, allowing for more flexibility in the data structure.Finally, combining these developments, they propose a hierarchical clustering method in the same framework.These methods enjoy some common properties such as very few assumptions on the data structure, wide choice of dissimilarity kernel options and more statistical power than competing alternatives in the HDLSS setting.
In this context, a natural extension of this methodology is to consider significance clustering in the more general K ≥ 3 group scenario.Such development would enable us to verify if a dataset should be naturally divided into K clusters, and return such partition accompanied by a p-value.[16] provide the theoretical basis for such extension, however, further developments are still required, such as extending the multigroup B n statistic to allow for groups of size one, determining the distribution of the test statistic, as well as adapting the optimization algorithms to the multi-group scenario.Because these are not all immediately generalizable to the arbitrary K case, in this paper we present an extension of the uclust method to the case of clustering the data into three groups, in which one may have size 1.This represents an important stepping stone for the development of the general method.
The steps to developing our three group clustering method are outlined as follows.First, in Section 2.1 we review the U-statistics-based theory of the homogeneity test of [15] and present the U-statistics theory for three groups.In Section 2.2, we present the extension of the B n statistic to contemplate three groups in which one may have size 1, to devise a clustering algorithm that can properly identify outlier elements.Additionally an investigation of theoretical properties that show its compatibility with the previous framework and asymptotic theory is also presented.In Section 2.3, we explore the variance of the extended B n and propose an approach for its estimation.In Section 3, we propose the uclust3 method which finds the statistically significant data partition that better separates the sample into three groups.In Section 4, simulation studies show that our method presents adequate control of Type I Error and increasing Power to identify clusters as they become more separated.Furthermore, our comparative simulation study shows that in the case where there are exactly three groups, our approach has comparable or greater power, that is, ability to correctly identify three clusters than competing methods.An application to real data is presented in Section 5. Finally, in Section 6 we discuss the overall results.

U-Statistics-based test for three group separation
Let X = (X 1 , . . ., X n ) be a random sample of n L-dimensional vectors divided into three groups G 1 , G 2 and G 3 of sample sizes n 1 , n 2 and n 3 , respectively, where n = n 1 + n 2 + n 3 .In the gth group, for g ∈ {1, 2, 3}, observations X (g) 1 , . . ., X (g) n g are assumed to be independent and identically distributed with an L-variate distribution F g .Here, the distribution F g admits finite mean vector μ g and positive definite dispersion matrix g (not necessarily multi-normal).Following the approach of [16,18], we define the functional parameter θ(F g , F g ) as where g, g ∈ {1, 2, 3} and φ(•, •) is a symmetric kernel of order 2. In the present context, φ(•, •) is usually taken to be a distance or dissimilarity measure.If we assume that θ(•, •) is a convex linear function of its marginal components, then we have for all distributions F g and F g , with equality holding whenever μ g = μ g .Note that the functional θ(•, •) can be used to define the average dissimilarity for both within and between groups.It follows from U-statistics theory that an unbiased estimator of this functional within groups θ(F g , F g ) is a generalized U-statistic [19], with kernel φ(•, •), defined as where g ∈ {1, 2, 3}.Analogously, the unbiased estimator for the between group functional θ(F g , F g ) is defined by where g, g ∈ {1, 2, 3} and g = g .
The combined sample U-statistic is usually decomposed as Decomposition ( 5) leads to the statistic B n , which provides the focal point of our methodology, Here U (g) n g for g ∈ {1, 2, 3} are U-statistics associated with within group dissimilarities, as defined in (3), and U (g,g ) n g n g , g = g ∈ {1, 2, 3} are the U-statistics associated with between group dissimilarities as defined in (4).Note that the definition of U (g) n g requires a minimum of 2 elements in the group.This imposes minimum group sizes n g ≥ 2, for g ∈ {1, 2, 3} for proper definition of B n .
For inference purposes, the B n statistic will be employed here to assess whether the data are homogeneous or if there is at least one group statistically separate.Thus the null hypothesis H 0 states that F 1 = F 2 = F 3 , while the alternative H 1 states that there are i = j, ∈ {1, 2, 3} such that F i = F j .In cases where groups G 1 , G 2 and G 3 have more than two elements, the asymptotic properties of B n are addressed in [16].The statistics B n is in the class of degenerate U-statistics for which asymptotic normality prevails and the convergence rates are L and/or √ n.Additionally, under the null, we have E(B n ) = 0 and under the alternative, E(B n ) > 0. The null hypothesis is rejected for large values of standardized B n , where the variance of B n , under H 0 , is obtained by a resampling procedure [18].

The extension of test U-statistics for tree groups
The group size restriction of the U-statistic B n definition in (6) constrains this method to cases where all subgroups have sizes n i ≥ 2, i = 1, 2, 3, and consequently clustering methods will fail when the data have an outlier.To build a clustering algorithm that admits groups of size 1, extensions of the B n statistic are required.
As a first step towards the more general case, we consider here the scenario in which only one group is allowed to have size 1.We can assume, without loss of generality, that only group G 1 may have one element, and define where n g ,n g and U (g) n g are defined, respectively, in ( 4) and ( 3).This is a natural extension of B n which coincides with that of expression ( 6) for group of sizes n 1 , n 2 , n 3 > 1, and thus all properties mentioned above are still valid for the new definition in that case.We ascertain the validity of these asymptotic properties or analogous alternatives in the case of n 1 = 1.
Note that, when G 1 has size one, we can rewrite B n as where U (1,g) 1,g and U (g) n g , g = 2, 3 are as defined in ( 4) and ( 3).If we consider the extension of B n in (7), then we can write the combined sample U-statistics as where W * n is an appropriate modification the term W n .Thus B n still arises from the decomposition of the combined sample U-statistics into B n and a modified term W n .This extended definition allows us to build a U-test when a group has size 1.We still have E[B n ] = 0, under the null hypothesis of overall group homogeneity.Additionally, if we make the assumption that for g = g ∈ {1, 2, 3} where then under the alternative we have that E[B n ] > 0. Note that this assumption is usual and when ( 8) is valid then Equation ( 2) is always satisfied.Asymptotic theory for the B n statistic with group sizes greater than 2 is developed in the work of [16], where it is established that B n is a degenerate U-statistic and asymptotic normality is provided.The following theorems show that the extended B n is a non degenerate U-statistic and establish its asymptotic distribution under H 0 for increasing dimension L and sample size n, requiring regularity conditions akin to those of the n 1 , n 2 , n 3 > 1 case.
Consider definition (7) for B n when n 1 = 1 and let and J 2 , J 3 and J 4 are random variables with normal distribution.Then Proof: See supplementary material.
This first result shows that B n converges asymptotically in n to a non-degenerate random variable whose limit distribution depends on the choice of kernel φ(•, •).It is stated here mainly for the sake of a complete presentation.
Let B n be defined by (7) for the case where n 1 = 1, and assume that all conditions in Theorem 2.1 hold.Suppose also that Proof: See supplementary material.
This second result shows that for fixed sample size n, the normalized B n converges to a standard normal distribution as the dimension of the data L increases.It provides the basis of our inference procedure for clustering in the HDLSS context.

Variance of B n
To build a hypothesis tests based on B n , we must first estimate its variance under H 0 .However, as shown below, even under H 0 , the variance of B n depends on the particular group configuration under consideration.For the homogeneity test of Section 3, we must evaluate this variance for the many group configurations visited in an optimization algorithm.This variance estimation is performed through a resampling procedure, however, it becomes computationally expensive to perform one resampling procedure for each individual group size configuration.To circumvent this issue, [15] propose a reweighting scheme taking advantage of analytic calculations for the variance for the case K = 2 groups and thus compute all variances from a single resampling procedure.In this section, we extend their argument to the case of K = 3 groups.
In this section, we provide an estimator for the variance of B n under H 0 based on U-statistics properties.When all groups have more than two elements, the Hoeffding decomposition of B n can be found in [16] which is given by where ψ 2 (•, •) is the second-order term of the Hoeffding decomposition of B n and Thereby, where τ 2 2 = Var(ψ 2 (X 1 , X 2 )).From [16], we also know that 1≤i<j≤n For the case in which we have three groups, G 1 , G 2 and G 3 , with sizes n 1 , n 2 and n 3 , respectively, where n 1 + n 2 + n 3 = n, it can be rewritten as and therefore Note that only τ 2 2 depends on the probability distribution of the data.Given three groups of sizes n 1 , n 2 and n 3 , the variance of B n for this configuration is estimated through a resampling procedure.Then, we use the relation in expression (19)   for the same data set.From (19), it follows that Thus estimating V n 1 ,n 2 through a single resampling procedure is sufficient to estimate the variance of B n for any other group configuration.Although the variance of B n is estimated under H 0 , we note that the choice of n 1 and n 2 may be important to reduce the bias of the variance estimator.To understand the C n (•, •) function's behaviour, we plot (18) assuming that n 1 , n 2 , n 3 ≥ 2. As τ 2 2 does not depend on group sizes, the behaviour of C n (•, •) governs the behaviour of B n 's variance and Figure 1 shows that smaller values are obtained when groups have balanced sizes, while larger values of C n (•, •) are obtained when group sizes are unbalanced.

Variance of the extended B n
We now consider the issue of estimating the variance of the extended B n when G 1 has only one element.Through the Hoeffding decomposition of (7), we have that the variance of the extended B n is where τ 2 1 = Var(ψ 1 (X 1 )) and τ 2 2 = Var(ψ 2 (X 1 , X 2 )) are, respectively, the variance of the first-and second-order terms of the Hoeffding decomposition, , Again, the choice of n 2 may affect the variance of the estimator.Denoting ( 21) by V n 2 and ( 23) by V n * 2 , we have from simple algebra that For a given n 2 , we can estimate V n 2 from a resampling procedure.Additionally, an estimate for τ 2 2 can be obtained from the strategy employed to estimate the variance of B n without outlier through expression (19) as Thus we have a procedure to estimate the variance of the extended B n for any group configuration from only two independent resampling procedures, through expression where τ 2 2 is obtained from the resampling employed to estimate the variance of B n without outlier and V n 2 is obtained from an additional resampling specific to the n 1 = 1 case.
In Figure 2, we have the behaviour of ζ 2 (n, n 2 ) as a function of n 2 .These results are fundamental for the computational feasibility of the methods defined in the following sections.

Homogeneity test for three groups
The assessment of group homogeneity is a great challenge for standard statistics, especially in the HDLSS context.The uclust algorithm effectively assesses overall group homogeneity by verifying whether there exists some significant partition of the data in two groups.Here we are proposing an extension of the uclust algorithm for data partitions in three groups G 1 , G 2 and G 3 .Assessing homogeneity through a combinatorial procedure, like the one proposed by [20] in which a utest is applied for each possible partition of all group elements into three subgroups, has serious computational restrictions due to the exponential increase in the number of tests that need to be performed.We show that (see Section S2 in the Supplementary Material) the number of possible assignments of all n elements in three subgroups is which becomes computationally onerous, especially for large sample size n.To address this issue, we proceed similarly to [15] proposing an optimization procedure to assess group homogeneity by finding the group configuration G 1 , G 2 and G 3 that maximizes the objective function By maximizing the standardized B n we must apply only one test.If this three group partition is found significant, then there is at least one subgroup that is significantly different from the others.However, if H 0 is not rejected for this partition, then all other three group partitions will also be non-significant, and the whole data will be considered homogeneous.While only the group configuration with maximum standardized B n is tested we have to consider the distribution of B n 's maximum under H 0 .Both [15,20] have shown through extensive simulations that in this context, for HDLSS data, it is reasonable and useful to make the simplifying assumption that the B n 's are independent for different group configurations.Then the asymptotic cumulative distribution function of the maximum standardized B n is given by where n * = γ 3 (n), for γ 3 (n) defined in ( 27) and (•) n * is the standard normal cumulative distribution function at the power n * .For F max (x) > 1 − α, we reject the null hypothesis of overall group homogeneity with significance level α.
The number of tests increases rapidly, even for moderate sample size due to the combinatorial nature of our approach.The maximum distribution in (29) adequately accounts for multiple testing for reasonably small values of n * .However, since n * rapidly increases this approach has some shortcomings.Proceeding similarly to [17], we use extreme value theory and model the maximum distribution as Gumbel for larger n * .As a result, for small n we employ the standard max distribution of (29), and when n * ≥ 2 28 the Gumbel distribution.

The clustering method uclust3
Our homogeneity test from Section 3 is a method that finds the configuration of three subgroups that maximizes the standardized B n .This is appropriate for the context, since if the homogeneity test accepts the null for this partition, then it would also be accepted for all other partitions.However, the standardized B n is not the best criteria to choose between competing partitions when more than one significant group separation exists.This issue is addressed in [17] and arises from the fact that the variance of B n has different magnitudes depending on subgroup sizes n 1 and n 2 (expression (18) dictates the relationship between variances, which is shown in Figure 1).Furthermore, the magnitude of the variance is markedly smaller when we have a size 1 group.Consequently, maximizing the standardized B n favours partitions with group sizes of smaller variance, namely To overcome this issue, we proceed by first testing overall group homogeneity which is based on maximizing the standardized B n .If the dataset is not homogeneous, we adopt instead the maximum B n as the criteria for finding the configuration that better divides the sample into three groups.Thus our significance clustering algorithm uclust3 will find the partition with maximum B n among the universe of all significant partitions in three groups.This is sufficient to ensure that the chosen configuration is statistically significant.
While conceptually the uclust3 criteria is simple, we must take note of a few issues to devise an efficient algorithm.First, it is not efficient to find all arrangements of the data in three groups that are statistically significant.Furthermore, we cannot simply test the clusters that maximize B n since there are non-homogeneous samples for which this maximal partition is not significant.
Based on these features, we propose a restricted search algorithm, which is based on the form of the variance of B n .For a non-homogeneous sample (as assessed by the homogeneity test), the algorithm starts from the group configuration that maximizes B n .If that partition is not significant, it then searches for the maximal B n among only those partitions whose variances are smaller than the previous one.This is suitable since only for smaller variances, the new standardized B n might be significant.As there is a difference in variance magnitudes, the algorithm treats separately cases in which we have a group of size one and cases with no outlier.The full algorithm can be found in Section S3 of the supplementary material.

Simulation studies
In this section, we present simulations that evaluate our proposed methodology.Note that B n 's variance depends on the group sizes and is particularly different when one of the groups has size 1.For this reason, our simulations typically have a configuration in which a group has size 1 and another configuration in which all groups have more than one element.
Furthermore, we have seen that B n 's variance is smaller at a central group configurations, where the three groups have approximately the same number of elements.Conversely, the variance is greater for extreme configurations in which the groups are unbalanced.These scenarios are explored in our simulation studies.

Simulations for the utest
We present here a simulation study to evaluate the empirical performance of the utest for three groups.We simulate data from independent normally distributed (i.i.Empirical results corroborate the theoretical properties.As the L increases, the rejection ratio also increases and as the groups become more separated, the power increases.When there is no separation, m 2 = 0, the rejection ratio is close to the significance level α, suggesting control of Type I error.Similar results are found for cases where all groups have more than one element (see Figure S1 in the Supplementary Material).

Simulations for the homogeneity test in uclust3
To evaluate the statistical properties of the homogeneity test in uclust3 considering the max distribution (29) with the Gumbel correction when appropriate, we simulate data according to the same distribution as in Section 4.1.For each sample size n in {10, 20, 50}, group G 1 has size 1 and group G 2 was set to have size n 2 = 2 or n 2 = n/2, and consequently the third group's size was defined as n 3 = n − 1 − n 2 .Table 1 shows the proportion of rejection of the null hypothesis for significance level α = 0.05 considering two scenarios of group separation m 2 and dimension L.
We note that even in an extreme group configuration, where group G 1 has size 1 and group G 2 has size 2, the method presents consistent empirical power to reject the null hypothesis.The empirical power increases as L and/or n and/or the group separation m 2 increases, emphasizing the inherent properties of the method.Supplementary Table S1 presents estimates of type I error rates for uclust3.we can observe that the method presents an adequate control of Type I Error for cases where L > > n (typically HDLSS scenario).This replicates the pattern found in previous papers, and is consistent with asymptotic normality for large L. Supplementary Table S2 presents the power of uclust3 for group configurations of sizes greater than 1.For small sample size n, the test had more difficulty in finding the correct clusters.However, for larger n the method showed an excellent performance.

Simulations for finding clusters
To evaluate the accuracy of our clustering method, we present simulation studies comparing uclust3 with kmeans clustering, one of the most popular clustering algorithms.The data were simulated under the same distribution scheme of Section 4.2.The methods were compared in terms of mean Adjusted Rand Index (ARI) which measures the agreement of clustering results with simulation scenarios, adjusting for randomness [21].An ARI of one indicates perfect matching.No inference is used in this analysis.This is an appropriate comparison as both methods are set to find exactly three groups.
Table 2 reports the results.Note that the clustering method in clustering method uclust3 and kmeans have comparable performances.However for larger sample sizes, as the clusters become better defined, with greater separation between the means, uclust3 outperforms kmeans.Table S3 shows that for the case where G 1 has size 1, kmeans tends to perform slightly better for smaller sample sizes.

Adjusted rand index (ARI)
We conducted a comparison of our clustering method, uclust3, with two hierarchical methods: uhclust from [17] and shc in the sigclust package available in [22], using the adjusted Rand index (ARI).The data used for the comparison were simulated under the same   The uclust3 method (dark grey) slightly outperforms uhclust (light grey) for scenarios with small sample size.In all scenarios, the shc method has a harder time finding the appropriate clusters.

True cluster proportion (TCP)
We complement the simulation study by considering the proportion of times that the algorithms found significant separation and correct groups which we call true cluster proportion (TCP).Figure 5 presents adjusted curves for TCP's for different levels of group separation (m 2 ) on the x axis.
Again, the uclust3 method (dark grey) outperforms uhclust (light grey), particularly in scenarios with n = 10 and 20, being better at finding the correct groups for less separation.In all scenarios, shc requires larger group separation to find the true clusters.

True cluster proportion (TCP) in the presence of outlier
We compare here our uclust3 with uhclust and shc in terms of their ability to correctly find statistically significant groups under a different data configuration.The data were simulated under the same scheme above, with group G 1 of size 1, group G 2 of size n 2 = n/3 and group G 3 of size n 3 = n − n 2 − 1.For all three methods the same significance level α = 0.05 was considered.
Figure 6 reports curves of the proportion times that the algorithms found significant separation and correct groups (TCP).In all scenarios, shc was not able to find the correct groups, with a proportion of correct answers equal to 0, and for this reason it was excluded from the analysis.
The uclust3 method (dark grey) outperforms uhclust (light grey) in all scenarios, being better at finding the correct groups for less separation.Note that the difference in performance between the methods is larger in this setting with outliers, when compared to the previous simulation, particularly when n increases.In Section S5, we explore clustering methods for correlated data.In this setting, the uclust3 and uhclust methods perform similarly, while shc again fails to find the correct configuration.

Application
We consider a simple example of image recognition to illustrate the applicability of our methodology.The data consists of images from three public figures (Tony Blair, Colin Powell and George W. Bush) which were selected from the Labelled Faces Wild (LFW) dataset [23].The data were run through OpenFace's convolutional neural network [24], a procedure that outputs a 128-dimensional representation of the faces which preserves Euclidean distances.In this illustrative application, we randomly select 10 images from each public figure in the above cited dataset, run uhclust, shc and uclust3 with significance level α = 0.05. Figure 7 presents the hierarchical clustering dendrogram annotated with pvalues for all tests performed in the uhclust method.We found four homogeneous groups, with a significant division in the Bush image group and an ARI = 0.8585.Figure 8 presents the dendrogram with corresponding shc analysis of the same data which produces six significant clusters, segregating Bush and Powell's images from the reminder and finding one outlier in Blair's group.The ARI for this case was 0.7788.Applying the uhclust3 method we found exactly the 3 groups, each corresponding to one of the public figures with associated p-value of 2 × 10 −6 and ARI = 1.
In Section S6 of the supplementary material, we consider the same dataset and public figures to carry out an analysis with three groups in which one has size 1.Figures S4 and S5 in the supplementary material present the clustering dendrograms annotated with results of all tests performed for uhclust and shc.None of these methods were able to identify the outlier, and both methods achieved an ARI of 0.8135593.However, when we applied the uclust3 method we found the correct groups with p-value ≤ 10 −6 ARI of 1.
As in most simulated scenarios, in the application our method was superior to the alternative inference clustering methods considered.Particularly, when compared to hierarchical alternatives, we have shown that it is better to use our three-way clustering  method, when we have external information that this should in fact be the correct number of groups.

Discussion
We have developed a U-statistics-based clustering method that separates a dataset specifically into three groups while assessing the significance of this partition.To do so, we first extend the statistic B n to account for three groups in which one may have only one element (outlier).We then make the necessary theoretical and methodological adjustments to extend the clustering method uclust to the case of K = 3 groups, defining the uclust3 method.These adjustments include asymptotic theory, a resampling algorithm for computing the variance of B n , considerations on the number of tests and an optimization algorithm to find the significant clusters.These developments can be seen as an important step in the extension of the significance clustering framework to an arbitrary number K of groups, with possibly multiple outlier groups.While the general method will build upon current findings, we note that the generalization is not necessarily straightforward and will be the focus of further work.
Nevertheless, both simulations and the application have shown that uclust3 is a competitive standalone method for the context in which external information indicates that the data should be divided into three groups.Direct comparison with other clustering methods, such as k-means, show that it is competitive when only considering the clustering task.However, few other methods are available to comparatively assess the inference task for HDLSS data.We carried out a simulation study to compare our method with the hierarchical version of uclust (uhclsut) and with another hierarchical approach (shc), which sould both be able to find a significant partition into three groups, when this partition exists.We found that in all scenarios tested shc had serious difficulties in finding the proper configuration, while uclust3 performed comparably or better than uhclust.Furthermore, in the application our method was the only one able to correctly find the three groups of figures, reinforcing the value of developing a method that tailors specifically to the correct number of groups K.
This U-statistics based methodology can be applied to a wide range of problems, since it makes very few assumptions about the dissimilarity measure employed and the distribution of the data.The required asymptotic normality is guaranteed as long as the dissimilarities have finite variance and the sum of all distance covariances do not grow too fast (O(L) see Theorem 2).Note, however, that uclust3 requires large L, since B n for n 1 = 1 is only asymptotically normal in the dimension L. As verified in previous work [17], for the settings in the simulation studies, in practice our tests achieve good Type I error control having difficulties only when L is smaller than 10n.This is, by excellence, the HDLSS setting.Furthermore, since our methodology is a natural extension of uclust it inherits many useful properties such as avoiding the hazards of directly estimating the high dimensional covariance matrix, by obtaining Var(B n ) through resampling.

Figure 3 .
Figure 3. Power curves for the utest with a group of size 1.
d.) samples divided into three groups G 1 , G 2 and G 3 .The elements of the L-dimensional vectors in G 1 are generated from i.i.d.normal distributions with mean m 1 = 0 and standard deviation equal to 1.The vectors in G 2 and G 3 have the same properties with mean m 2 and m 3 , respectively.For simplicity we take m 3 = 2m 2 .The sample size n takes values in {10, 20, 50}.

Figure 3
presents power curves of the utest for three groups with separation degree m 2 , where the vectors have dimension L = 1000 ( grey) and L = 2000 (black) and we have 100 replications of each scenario.Furthermore group G 1 has size 1 and group G 2 was has size n 2 = n/3 , with x representing the integer part of x.Naturally the third group's size is defined as n 3 = n − 1 − n 2 .The significance level used to determine whether the test rejects the null hypothesis that the elements in G 1 , G 2 and G 3 have the same distribution was α = 0.05.

Figure 5 .
Figure 5. Curves of the proportion of times that the methods found significant correct groups (TCP) for uclust3 (dark grey), uhclust (light grey) and shc (black) for dimension L = 1000, 2000.

Figure 6 .
Figure 6.Curves of the proportion of times that the methods found significant correct groups (TCP) for uclust3 (dark grey) and uhclust (light grey) for dimension L = 1000, 2000.

Figure 7 .
Figure 7. Annotated dendrogram of significance analysis for hierarchical clustering uhclust for 30 pictures of 3 public figures.P-values and corrected significance levels α * are shown for each test performed at the corresponding node.

Figure 8 .
Figure 8. Annotated dendrogram of significance analysis for hierarchical clustering shc for 30 pictures of 3 public figures.P-values and corrected significance levels α * are shown for each test performed at the corresponding node.
to estimate B n 's variance for all other group configurations from the original variance estimate.Let G * 1 , G * 2 and G * 3 , with sizes n * 1 , n * 2 and n * 3 , respectively, where n * 1 + n * 2 + n * 3 = n be another group configuration (21)n 3 = n − n 2 − 1 (see Supplementary Material).Note that in expression(21)the terms τ 2 1 and τ 2 2 depend on the probability distribution of the data, ζ 1 (•) depends only on n and ζ 2 (•, •) depends on n and n 2 since n 3 = n − n 2 − 1.Thus for another group configuration keeping G 1 with size one, the only change occurs at n 2 , say n * 2 .For this new group configuration, the variance of the extended B n is given by Var

Table 1 .
Empirical power of the homogeneity test uclust3 with a group of size 1.

Table 2 .
Comparison of mean ARI and standard deviation (Sd) of the accuracy in clustering of kmeans and uclust3 methods.