A large scale clustering scheme for kernel K-Means

Kernel functions can be viewed as a non-linear transformation that increases the separability of the input data by mapping them to a new high dimensional space. The incorporation of kernel functions enables the K-Means algorithm to explore the inherent data pattern in the new space. However, the previous applications of the kernel K-Means algorithm are confined to small corpora due to its expensive computation and storage cost. To overcome these obstacles, we propose a new clustering scheme which changes the clustering order from the sequence of samples to the sequence of kernels, and employs a disk-based strategy to control data. The new clustering scheme has been demonstrated to be very efficient for a large corpus by our experiments on handwritten digits recognition, in which more than 90% of the running time was saved.


Introduction
A number of kernel-based learning methods have been proposed in recent years [1,2,3]. All of these methods employ kernel function to increase the separability of data. Generally speaking, kernel function implicitly defines a non-linear transformation that maps the data from their original space to a high dimensional space where the data are expected to be more separable. Consequently, the kernel methods may achieve better performance by working in the new space.
K-Means is an unsupervised learning algorithm that partitions the data set into a selected number of clusters under some optimization measures. For example, we often want to minimize the sum of squares of the Euclidean distance between the samples and the centroids. The assumption behind this measure is the belief that the data space consists of isolated elliptical regions. However, such assumption isn't always held on specific applications. To tackle this problem, one idea is to investigate other measures, e.g., the cosine similarity used in information retrieval. An alternate idea is to map the data to new space that satisfies the requirement of the optimization measure. In this case the kernel function is a good choice.
Usually the extension from K-Means to kernel K-Means is simply realised by expressing the distance in the form of kernel function [4,5]. However, such implementation suffers serious problems, such as the high clustering cost due to the repeated calculations of kernel values, or insufficient memory to store the kernel matrix, that make it unsuitable for large corpora. We propose an efficient large-scale clustering scheme to break through this limitation. Different from the previous algorithm in which the clustering is implemented on the sequence of samples, our scheme changes it to the sequence of kernels. This enables us to use a disk-based data management strategy that theoretically extends the storage space to the entire disk, and minimizes the number of I/O operations as well.

Kernel Function
Sometimes it isn't sufficient for a given learning machine to work in the input space because the assumption behind the machine doesn't match the real pattern of the data. For example, SVM and Perceptron require the data are linearly separable, while K-Means with Euclidean distance expects the data distribute into elliptical regions. When the assumption isn't held, we may apply some kind of transformation to the data, mapping them to a new space where the learning machine can be used. Kernel function provides us a means to define the transformation.
Suppose we are given a set of samples 1 , and a mapping function φ that maps i x from the input space D R to a new space Q . The kernel function is defined as the dot product in the new space Q : An important fact about kernel function is that it is constructed without knowing the concrete form of φ [6].
Namely, the transformation is defined implicitly. Three commonly used kernel functions are listed below: Polynomial The main weaknesses of kernel function include: First, some properties of the new space are lost, e.g. its dimensionality and value range, due to the lack of the explicit form for φ . Second, the determination of the appropriate kernel form for a given data set has to be realised through experiments. In addition, the computation and storage cost are increased by a wide margin.

From K-Means to Kernel K-Means
Suppose the data set has N samples 1 Means algorithm aims to partition the N samples into K clusters, 1 C , 2 C , …, K C , and then returns the centre of each cluster, 1 m , 2 m , …, K m , as the representatives of the data set. Thus a N-point data set is compressed to a Kpoint "code book". The batch mode K-Means clustering algorithm using Euclidean distance works as follows: to the closest centre, forming K clusters. Namely, compute the value of indicator function 4. Repeat step 2 and 3 until converge.
The key issue extending traditional K-Means to kernel K-Means is the computation of distance in the new space. Let Euclidean distance between i u and j u is written as: By applying (3.7) to the traditional K-Means, we obtain the kernel-based K-Means algorithm:

Large Scale Clustering Scheme
A careful analysis will find the kernel K-Means algorithm described in last section has serious problem on large corpus. Note in step 2 and 3, the kernel ) ( x x again and again, especially when the data set is large or the dimensionality is high. Therefore, we should separate the computation of kernels from the clustering.
That is, compute the kernel matrix H that contains all of the kernels before Step 1: Nevertheless, this will cause another problem: the kernel matrix H may be too large to be stored in memory.
Say the data set has 30,000 samples. The number of kernel values that need be determined is about 450,000,000 (given the symmetry of H ) which would exceed the memory capability of a common PC. Moreover, the training corpus for practical application is usually much larger. Some solutions have been proposed to speed up the clustering on large corpus [7,8]. All of these methods are based on the knowledge of the concrete form of sample and some properties of the data set. For example, some techniques need to know the dimensionality and value range of the data that are unavailable for kernel K-Means.
We solve the large scale clustering in a different fashion. First we change the clustering order from the sequence of sample to the sequence of kernel that enables us to take an efficient way handling the kernel matrix H . Moreover, we use the disk space to make up the insufficiency of memory. Therefore, the size of H can be theoretically extended to as large as the entire disk. We split H into blocks which size is determined according to the I/O capability and the affordable memory. For example, assuming H has 2 N kernels and the memory can store 2 S of them, H is then split into The new clustering scheme is described in the C-style pseudo-code.
1. Compute kernel matrix H and store every necessary block B to the disk.

For every kernel
• Check the clusters that u x and v x belong to: where u θ and v θ are the variables denoting the clusters. • If v u < , ignore this kernel. (Only the kernel below or on the diagonal of H need be processed given the symmetry of H .) 6. Repeat step 4 and 5 until every block has been processed. 7. For each training sample i x and cluster k C , compute (4.9) otherwise all for This clustering scheme assumes that all of the variables ) ( k i C , f x can be maintained in memory. If this assumption doesn't hold in the case that the corpora are too large, one can also split them into blocks as we did for the kernel matrix. A very good point of this scheme is that, the computations in the step 1, 4, 5 and 6 are only dependent on the size of H . When we double the cluster number, the clustering time won't increase too much.

K-Means Classifier
In order to compare the performance of kernel K-Means with traditional K-Means, we choose the pattern recognition problem for our experiments because the classification error rate can give us a straightforward metrics evaluating the performance.
The K-Means based classifier works as follows. In the training phase, each class is built a K-point "code book" using K-Means algorithm. In the test phase, the distance between a new sample x and class γ Γ is computed as: where γ l m is x 's l -th nearest neighborhood in the "code book", and L is the smoothing factor that The K-Means classifier is similar to the K-Nearest Neighbourhood except that it uses the small size "code book" for generalization instead of the entire data set.

Data Set
Our experiments use the MNIST as the training and test set [9]. MNIST is a 10-class database designed for handwritten digit recognition. The training set has 60000 samples in which the size of each class varies from 5421 (character '5') to 6742 (character '1'). The test set has 10000 samples in which each class equally has 1000 samples. Every sample consists of 28 by 28 pixels which value is inside the interval [0,255]. In our experiments, the value was normalised to [0,1].
We further divided the training set into two parts, the pure training set and the cross-validation set, with the ratio of 2:1. The form of kernel function and its parameters are determined on the validation set. Through a series of preliminary experiments, we found the neural kernel function (see formula 2.4) with the parameter value 0045 . 0 = a and 11 . 0 = b works well on MNIST. We used them for the following experiments.

Experimental Results
Our first experiment was to compare the performance of kernel K-Means with traditional K-Means as classifier. Table 5.1 presents their misclassifications on MNIST.
In this experiment the number of clusters K is set to 512 that is also determined by preliminary experiments on cross-validation set. The parameter L is the smoothing factor in (5.1). The result shows that the kernel K-Means works consistently better than its traditional version. When L=3, the best case for both, kernel K-Means achieves 10.8% reduction on the misclassification. This supports the observation that a good kernel function can make data more separable by mapping them to a new space. Means  1  497  483  3  471  420  5  476  433  8  511  471  16  630  560  32  784  695  64  1003  889  128 1401 1206 Table 5.1 Performance of K-Means and kernel K-Means The second experiment was to demonstrate the efficiency of the new clustering scheme. We ran traditional K-Means and the new clustering scheme on the 40,000sample pure training set, and compare the time they used. Both algorithms were compelled to finish 20 iterations. The block size in the new scheme was set to 4000 by 4000 which resulted in 55 blocks. The experiment was run on a P-3 PC. Table 5.2 presents the running time in minutes. Table 5.2 shows most of the time was spent on the computation of kernels. If we use the standard kernel K-Means to do clustering, in which the kernel matrix is repeatedly computed in every iteration, the total running time would be more than 184*20 = 3680 minutes. So our new scheme saved at least 90% of the running time.
The time spent on the I/O operations also took a large portion, but it is the minimised value one can achieve. As we analysed before, the change to the number of cluster doesn't affect the new scheme too much. The time for the traditional K-Means was doubled when K is changed from 512 to 1024, while it increased only about 11.7% for the new scheme.

Summary
We provide a comprehensive description for kernel K-Means algorithm, as well as a new large-scale clustering scheme. All of the previous solutions to speed up the training on large corpus are dependent on some properties that are unavailable for kernel method. Our scheme solves the problem by clustering on the kernels instead of the samples, and using disk-based data management strategy. The effectiveness and efficiency of the new scheme are demonstrated by our experiments on hand-written digits recognition.