Prior-Free Rare Category Detection

Rare category detection is an open challenge in machine learning. It plays the central role in applications such as detecting new ﬁnancial fraud patterns, detecting new network malware, and scientiﬁc discovery. In such cases rare categories are hidden among huge volumes of normal data and observations. In this paper, we propose a new method for rare category detection named SEDER, which requires no prior information about the data set. It implicitly performs semiparametric density estimation using specially designed exponentially families, and then picks the examples for labeling where the neighborhood density changes the most. SEDER can work in the cases where the data is not separable. Its unique feature over all existing methods lies in its prior-free nature, i.e. it does not require any prior information about the data set (e.g. the number of classes, the proportion of the diﬀerent classes, etc.). Therefore, it is more suitable for real applications. Experimental results on both synthetic and real data sets demonstrate the superiority of SEDER.


Introduction.
Classical supervised learning methods require labeled examples representing each class, from which classifiers may be induced to predict class membership for unlabeled data. Whereas classifier induction has been well studied over the years, both for the balanced case [13], and the unbalanced case [18] [16] [12], very few methods have been proposed to discover classes in an unlabeled data set by proposing initial candidate examples of each class to a labeling oracle [14] [8] [10] [11]. Active learning [9] [6] focuses on the related problem of finding maximally discriminative examples to label, once each class has been discovered. If the data set is well balanced, we may use random sampling to find all the classes. On the other hand, if the data set is skewed, i.e. some classes dominate the data set (the majority classes) and the other classes rarely occur (the minority/rare classes), random sampling can prove extremely inefficient at discovering all the classes, especially the rare ones. It is often the case that these rare classes are of key importance; therefore, we need more sophisticated methods for rare category detection. * Carnegie Mellon University.
Rare category detection has a wealth of applications. For example, in financial fraud detection, the vast majority of financial transactions are legitimate, but a small number may be fraudulent; detecting early instances of the fraud patterns is a major first step towards systematically finding and stopping such illicit activity [3]. Another example is network intrusion detection. Systematically finding the early onset of new malicious network activities among huge volumes of routine network traffic is a critical unmet challenge [19]. Similarly, in astronomy, most of the objects in sky survey images are explainable by current theories and models, and only a tiny fraction of the objects may lead to new discoveries [14]. Rare category detection is also a bottleneck in reducing the sample complexity of active learning [2] [5].
Despite its importance, up until now, only a few methods have been proposed to address the rare category detection challenge in a general setting. For example, the method based on mixture models proposed in [14] is among the first attempts in this direction; in [8], the authors proposed a generic consistency algorithm, and proved upper bounds and lower bounds for this algorithm in some specific situations. Both of the two methods require that the support regions of the different classes be separable or near-separable to work well. The former also needs to be given the number of classes in the data set in order to train a reasonable mixture model [14]. More recently, in [10], the authors proposed NNDM algorithm for rare category detection, which is essentially a local-density-differential-sampling strategy. Different from the above two methods, NNDM does not depend on the separability assumption. In [11], the authors generalize the theoretical results for the binary case in [10] to the cases where we have multiple rare classes. However, NNDM needs to be given the number of classes as well as the proportion of the different classes in the data set, which is unrealistic in many real applications.
In this paper, we focus on the more challenging case where we do not have any prior information about the data set. The proposed method, SEmiparametric Density Estimation based Rare category detection (SEDER), implicitly performs semiparametric density estimation using specially designed exponentially fami-lies, and selects the examples with the largest norm of the gradients for labeling by the oracle. In this way, it focuses on the areas with the maximum change in the local density. Different from existing methods, SEDER does not require any prior information about the data set. Therefore, it is more suitable for real applications.
The rest of the paper is organized as follows. In Section 2, we introduce the specially designed exponentially families used in SEDER, and derive the scoring function. The complete algorithm of SEDER is presented in Section 3. In Section 4, we compare SEDER with state-of-the-art techniques on both synthetic and real data sets. Finally, we conclude the paper in Section 5.
2 Semiparametric Density Estimation for Rare Category Detection. In rare category detection, we make the following assumptions: 1) the distribution of the majority classes is sufficiently smooth; and 2) the minority classes form compact clusters in the feature space. An example of the underlying distribution where these assumptions are satisfied is shown in Figure 1. Note that these assumptions are much more realistic than the separable/nearseparable assumption assumed in [8] [14]. Based on our assumptions, abrupt changes in local density indicate the presence of rare classes. By sampling in these areas, we have high probability of finding examples from the rare classes. Following this line of reasoning, our proposed method SEDER implicitly estimates the density using specially designed exponential families, which essentially define a semiparametric model. At each data point, we set the score to be the norm of the gradient of the estimated density, which measures the maximum change rate of the local density, and pick the examples with the largest scores to be labeled by the oracle. Although the intuition of SEDER and NNDM [10] is quite similar: to pick the examples with the maximum change in the local density, NNDM is a nearest-neighbor-based method, it depends on the proportion of different classes to set the size of the neighborhood, and the scores of the examples roughly indicate the change in the local density; whereas SEDER is based on semiparametric density estimation, it is prior-free, i.e. it does not require any prior information about the data set, and the scores measure exactly the maximum change rate in the local density.
In this section, we first define some notations in subsection 2.1, and then introduce the specially designed exponential families in subsection 2.2. Finally we present the scoring function in subsection 2.3.

Notation.
In rare category detection, we are given a set of unlabeled examples S = {x 1 , . . . , x n }, x i ∈ R d , which come from m distinct classes, i.e. y i ∈ {1, . . . , m}, ∀i ∈ {1, . . . , n}. Without loss of generality, assume that n i=1 x i = 0 and 1 n n i=1 x 2 i = 1. The proportion of some classes is much smaller than that of the other classes. They are the so-called rare classes. Table 1 summarizes the notations used in this paper. Our goal is to request as few total labels as possible in order to find at least one example from each class, especially those rare classes which are of particular interest to us. The i th unlabeled example The dimensionality of the feature space y i The class label of The density defined by specially designed exponential families g 0 (x) The carrier density β 0 The normalizing parameter in g β (x) t(x) The p × 1 vector of sufficient statistics t j (x) The j th component of t(x) β 1 The p × 1 parameter vector β j

1
The j th component of β 1 σ j The bandwidth for the j th feature β (β 1 , β 0 ) β The maximum likelihood estimate of β l(β) The log-likelihood of the data g j β (x j ) The marginal distribution of the j th feature based on g β (x) g j (x j ) The true marginal distribution of the j th feature b j Positive parameter which is a function of β j The score of x i

Specially Designed Exponential Families.
Traditional density estimation methods belong to two categories [7]: by fitting a parametric model via maximum likelihood, or by nonparametric methods such as kernel density estimation. For the purpose of rare category detection, parametric models are not appropriate since we can not assume a specific form of the underlying distribution for a given data set. On the other hand, the estimated density based on nonparametric methods tends to be under-smoothed, and the examples from rare classes will be buried among numerous spikes in the estimated density. As proposed in [7], these two kinds of methods can be combined by putting an exponential family through a kernel density estimator, the so-called specially designed exponential families. It is a favorably compromise between parametric and nonparametric density estimation: the nonparametric smoother allows local adaptation to the data, while the exponential term matches some of the data's global properties, and makes the density much smoother [7]. To be specific, the esti- is a p×1 vector of sufficient statistics, β 1 is a p×1 parameter vector, and β 0 is a normalizing parameter that makes g β (x) integrate to 1. In our application, we use the kernel density estimator with the Gaussian kernel as the carrier density, i is the j th feature of the i th data point, and σ j is the bandwidth for the j th feature. In SEDER, σ j is determined by cross validation [15] on the j th feature. Here, the parameters β = (β 1 , β 0 ) can be estimated according to the following theorem.
where β j 1 is the j th component of the vector β 1 . Secondly, the log-likelihood of the data is Taking the partial derivative of l(β) with respect to β j 1 , we have: Setting the partial derivative to 0, we have that the maximum likelihood estimateβ of β satisfies In SEDER, we set the vector of sufficient statistics If we estimate the parameters according to Theorem 2.1, different parameters will be coupled due to the normalizing parameter β 0 . Let β j 1 be the j th component of the vector β 1 . In order to de-couple the estimation of different β j 1 s, we make the following changes. Firstly, we decompose β 0 into β j 0 s such that d j=1 β j 0 = β 0 , then g β (x) can be seen as a kernel density estimator with a 'special' kernel, i.e. g β ( ]. Next, we relax the constraint on β j 0 s, and let them depend on x j i in such a way that where β j 0i implies the dependence of β j 0 on x j i . In this way, the marginal distribution of the j th feature is To estimate the parameters in our current model, we have the following theorem.
Theorem 2.2. The maximum likelihood estimatesβ j 1 andβ j 0i of β j 1 and β j 0i satisfy the following conditions: Proof First of all, according to Equation Then the log-likelihood of the data on the j th component is Taking the partial derivative of l(β j 1 ) with respect to β j 1 , we have: Setting the partial derivative to 0, we have that the maximum likelihood estimateβ j .
Notice that according to Theorem 2.2, β j 1 s can be estimated separately, which greatly simplifies our problem. At the first glance, Equation (2.2) is hard to solve. Next, we let β j 1 = (1 − 1 b j ) 1 2(σ j ) 2 , where b j = 1 is a positive parameter, the introduction of which will simplify this equation. According to Equation (2.1), β j 0i can be expressed in terms of b j , i.e.
Therefore, the estimated density becomes In general, the value ofβ j 1 is very close to 0, and gβ(x) is a smoothed version of g 0 (x). Therefore,b j should be close to 1, and we can re-write the above equation as follows.
This is a second-degree polynomial equation of b j , and the roots can be easily obtained by Vieta's theorem 2 , i.e. ∀j ∈ {1, . . . , d} Theorem 2.3. Let g j (x j ) be the true density for the Proof For the sake of simplicity, let z = x j , h = σ j , and f (z) = g j (x j ). Then , B = h 2 , and C = 1 n n k=1 (z k ) 2 . Consider the following regression problem where the true regression function r(z) = z 2 , the noise has mean 0, and we use kernel regression to estimate this function. Then A − C is the bias of kernel regression on the training data, i.e.
At the beginning of Section 2, we have made the following assumptions: 1) the distribution of the majority classes is sufficiently smooth; and 2) the minority classes form compact clusters in the feature space. In this case, the first order derivative of the density would be close to 0 for most examples, and have large absolute values for a few examples near the rare classes. Therefore, the condition in Theorem 2.3 is always satisfied, and the exponential term appended to the carrier density decreases away from the origin.

Scoring Function.
Once we have estimated all the parameters using Equation (2.4), we can measure the change in the local density at each data point based on the estimated density in Equation (2.3). Note that at each data point, if we pick a different direction, the change in local density would be different. In SEDER, we measure the change along the gradient, which gives the maximum change at each data point.
Proof of ∀x ∈ R d , let the gradient vector be w ∈ R d . We have ∀l ∈ {1, . . . , d} where w l is the l th component of w.
Therefore, the maximum change rate of the density at x is If the distribution of the majority classes is sufficiently smooth, and the minority classes form compact clusters in the feature space, the minority classes are always located in the regions where the density changes the most. Therefore, in SEDER, to discover the rare classes, we set the score of each example to be the maximum change rate of the density at this example, i.e. ∀k ∈ {1, . . . , n} where s k is the score of x k . We pick the examples with the largest scores for labeling until we find at least one example from each class. To address this problem in SEDER, we make use of the following heuristic: if x i ∈ S has been labeled, we would preclude x k from being selected. In other words, if an unlabeled example is very close to a previously labeled one, it is quite likely that the labels of the two examples are the same, and labeling that example will not have a high probability of detecting a new rare class. The size of the neighborhood is set to 3σ j such that the estimated density for the examples outside this neighborhood using Gaussian kernel is hardly affected by the labeled example. It should be pointed out that the feedback strategy is orthogonal to the remaining parts of the proposed algorithm. In our experiments, we find that despite its simplicity, the current strategy leads to satisfactory performance.
The proposed method, SEDER, is summarized in Algorithm 1. It works as follows. Firstly, we initialize the set I of selected examples and the set L of their labels to empty sets. Then step 2 to step 5 calculate the parameters in our model. Step 6 to step 8 calculate the score for each example in S. Finally, step 9 to step 13 gradually include the example with the maximum score into I and its label into L until we run out of the labeling budget. In each round, the selected example should be far away from all the labeled examples.
Note that: 1) unlike the methods proposed in [10] [14], SEDER does not need to be given the number of classes in S or any other information, hence it is more suitable for real applications; 2) in SEDER, we do Calculate the bandwidth σ j using cross validation [15].

4:
Calculate the maximum likelihood estimateb j of the parameter b j according to Equation (2.4). 5: end for 6: for i = 1 : n do 7: Calculate the score s i of the i th example according to Equation (2.5) using the estimated parameters. 8: end for 9: while the labeling budget is not exhausted do 10: Query x = argmax x i ∈S s i for its label y x 12: 13: end while not need to explicitly calculate the density at each example; 3) SEDER does not depend on the assumption that different classes be separable or near-separable.

Experimental Results.
In this section, we compare SEDER with NNDM [10], Interleave (the best method proposed in [14]), random sampling (RS) and SEDER with b j = 1 for j = 1, . . . , d (abbreviated as Kernel, which is equivalent to using kernel density estimator to estimate the density and to get the scores) on both synthetic and real data sets. For this purpose, we run these methods until all the classes have discovered, and compare the number of label requests by each method in order to find a certain number of classes. Note that SEDER, NNDM and Kernel are deterministic, whereas the results for Interleave and random sampling are averaged over 100 runs.
Here we would like to emphasize that only SEDER, RS and Kernel do not need any prior information about the data set, whereas NNDM and Interleave need extra information about the data set as inputs, such as the number of classes and the proportion of the different classes. When such prior information is not available, which is quite common in real applications, NNDM and Interleave are not applicable. Figure 1(a) shows the underlying distribution of a 1-dimensional synthetic data set. The majority class with 2000 examples has a Gaussian distribution with a large variance; whereas the minority classes with 50 examples each correspond to the two lower-variance peaks. As can be seen from this figure, the first two examples selected by SEDER (red stars) are both from the regions where the density changes the most. Figure 1(b) shows a 2-dimensional synthetic data set. The majority class has 2000 examples (blue dots) with a Gaussian distribution. The four minority classes (black circles) all have different shapes, and each has 267, 280, 84 and 150 examples respectively. This data set is similar to the one used in [11]. To discover all the classes, SEDER only needs to label 6 examples, which are represented by red stars in the figure; whereas random sampling needs to label more than 50 examples on average.   Table  2. Notice that all these data sets are skewed: the proportion of the smallest class is less than 5%. For the last three data sets (Page Blocks, Abalone and Shuttle), it is even less than 1%. We refer to these three data sets as 'extremely' skewed; whereas the remaining two data sets (Ecoli and Glass) are referred to as 'moderately' skewed. First, let us focus on the 'moderately' skewed data sets, which are shown in Figure 2. With Ecoli data set, to discover all the classes, NNDM needs 36 label requests, Interleave needs 41 label requests on average, RS needs 43 label requests on average, Kernel needs 78 label requests, and SEDER only needs 20 label requests; with Glass data set, to discover all the classes, NNDM needs 18 label requests, Interleave needs 24 label requests on average, RS needs 31 label requests on average, Kernel needs 102 label requests, and SEDER needs 22 label requests. Therefore, if the data set is 'moderately' skewed, the performance of SEDER is better than or comparable with NNDM, which requires more prior information than SEDER, including the number of classes in the data set and the proportion of the different classes.

SYNTHETIC DATA SETS
Next, let us look at the 'extremely' skewed data sets. For example, in Shuttle data set, the largest class has 580 times more examples than the smallest class. With Page Blocks data set (Figure 3(a)), to discover all the classes, SEDER needs 36 label requests, NNDM needs 23 label requests, Interleave needs 77 label requests on average, RS needs 199 label requests on average, and Kernel needs more than 1000 label requests; with Abalone data set (Figure 3(b)), to discover all the classes, SEDER needs 316 label requests, NNDM needs 179 label requests, Interleave needs 333 label requests on average, RS needs 483 label requests on average 3 , and Kernel needs more than 1000 label requests; with Shuttle data set (Figure 3(c)), to discover all the classes,  Based on the above results, we have the following observations. First, SEDER, RS and Kernel require no prior information about the data set, and yet SEDER is significantly better than RS and Kernel in all the experiments. Second, if the data is not separable, the performance of Interleave is worse than SEDER (except Figure 3(c)), even though it is given the additional information about the number of classes in the data set. Finally, although NNDM is better than SEDER for the 'extremely' skewed data sets, in real applications, it is very difficult to estimate the number of classes in the data set, not to mention the proportion of the different classes. If the information provided to NNDM is not accurate enough, the performance of NNDM may be negatively affected. Moreover, when such information is not available, NNDM is not applicable at all.

Conclusion.
In this paper, we have proposed a new method for rare category detection named SEDER, which requires no prior information about the data set. It implicitly estimates the density using specially designed exponential  To the best of our knowledge, SEDER is the first method tailored for the very challenging case where no prior information about the data set is available. Therefore, we expect it be more suitable for many real applications. The proposed method is based on sound theoretical analysis and its effectiveness is demonstrated by extensive experimental evaluations.