Density-based clustering for data containing two types of points

When only one type of point is distributed in a region, clustered points can be seen as an anomaly. When two different types of points coexist in a region, they overlap at different places with various densities. In such cases, the meaning of a cluster of one type of point may be altered if points of the other type show different densities within the same cluster. If we consider the origins and destinations (OD) of taxicab trips, the clustering of both in the morning may indicate a transportation hub, whereas clustered origins and sparse destinations (a hot spot where taxis are in short supply) could suggest a densely populated residential area. This cannot be identified by previous clustering methods, so it is worthwhile studying a clustering method for two types of points. The concept of two-component clustering is first defined in this paper as a group containing two types of points, at least one of which exhibits clustering. We then propose a density-based method for identifying two-component clusters. The method is divided into four steps. The first estimates the clustering scale of the point data. The second transforms the point data into the 2D density domain, where the x and y axes represent the local density of each type of point around each point, respectively. The third determines the thresholds for extracting the clusters, and the fourth generates two-component clusters using a density-connectivity mechanism. The method is applied to taxicab trip data in Beijing. Three types of two-component clusters are identified: high-density origins and destinations, high-density origins and low-density destinations, and low-density origins and high-density destinations. The clustering results are verified by the spatial relationship between the cluster locations and their land-use types over different periods of the day.


Introduction
When one type of point is distributed over a region, an area containing points of high density can be seen as a cluster, indicating a meaningful anomaly. A number of densitybased clustering methods have been developed to identify clusters from data containing only one type of point (Ester et al. 1996, Ankerst et al. 1999, Pei et al. 2006, Wan et al. 2012). However, when two different types of points coexist and overlap with various density combinations (say, origins and destinations (OD) of taxicab trips, crime venues and traffic accidents, restaurants and shops in a city), the meaning of a cluster of one type of point may change if points of the other type covered by this cluster have a different density (i.e., aggregation, random, or sparse). Here, we define a two-component cluster as a group containing two types of points, at least one of which exhibits clustering. We then give some examples to illustrate the significance of identifying two-component clusters.
(1) ODs of taxicab trips: A cluster with a high density of origins and destinations may indicate a transportation hub. One with a high density of origins and a low density of destinations in the morning (a hot spot where taxis are in short supply) may indicate a residential area with a dense population.
(2) Traffic accidents and crime venues in a city: High levels of traffic accidents with sparse crime venues may indicate a flaw in the design of the local transportation system; those with higher densities of crime venues may indicate the mutual inducement of traffic accidents and crimes.
(3) High-temperature events and precipitation events: Clustered high-temperature events aggregated with precipitation events may indicate a flood; those with a low density of precipitation events may indicate a drought.
From these examples, we can appreciate the importance of identifying two-component clusters. Regarding (1), taxi drivers may find more opportunities in the cluster of highdensity origins and low-density destinations, whereas passengers wishing to find a taxi should move to the cluster of low-density origins and high-density destinations. Of course, a cluster with a high density of origins and destinations is also a good choice for both taxi drivers and passengers. In addition, the city government can locate hot spots where taxis are in short supply and those where they exceed demand. This information may shed light on the city dynamics and thereby assist urban planning. Regarding (2), the transportation system may require reconstruction around the cluster of high-density traffic accidents and sparse crime venues, while more policemen and surveillance monitors should be deployed to the area of aggregated crime venues and traffic accidents. For Point (3), different relief measures should be taken at clusters where temperature and precipitation events have different combinations. Nevertheless, most previous clustering methods can only deal with one type of point and are difficult to identify two-component clusters. To this end, it is important to develop a method for two-component clusters.
In recent years, significant progress has been made in density-based clustering methods. Ester et al. (1996) proposed Density Based Spatial Clustering of Applications with Noise (DBSCAN) to identify high-density point groups from point data. However, in DBSCAN, the parameters (e.g., eps, the distance used to separate clusters from noise) must be defined in an interactive way. To adjust the parameters visually, Ankerst et al. (1999) constructed an enhanced density-connected algorithm, referred to as Ordering Points To Identify the Clustering Structure (OPTICS). Many modifications of DBSCAN were subsequently proposed to enhance the cluster identification ability (Sander et al. 1998, Daszykowski et al. 2001, Roy et al. 2005, Pei et al. 2009, Wang and Huang 2009, Ashour et al. 2011, Tran et al. 2013. However, existing clustering models are still limited to one type of point and are unsuitable for clusters containing two (or more) types of points.
The spatial scan statistic relates to two different types of points (i.e., case and control). The cases are seen as a single-point process occurring with low probability, whereas the controls are generally distributed across the research area (Kulldorff 1997, Kulldorff et al. 2006, Neill 2006. The spatial scan method looks for the most likely cluster by sweeping over all zones circumscribed by circles of varying radii centered at each polygon in the area. Later, modifications were made to adapt the method to arbitrarily shaped clusters (Duczmal and Assunção 2004, Janeja and Atluri 2008, Pei et al. 2011, Wan et al. 2012. The spatial scan method finds clusters with a high ratio between cases and controls. Nevertheless, a high-ratio area identified by the spatial scan method may not signify a real anomaly of cases (namely, false positive), because the number of controls could be quite low. To better visualize a high-risk area, Guo et al. (2012) used the shared nearest distance to form the OD points of taxicab trips into a network, the nodes of which are actually similar-size clusters. The distribution of the net flow ratio can then be determined at each node. Although this method reveals anomalies of high ratios between inflow and outflow, smoothed by a window covering the same number of points instead of the same size, the anomalies identified are not those in which points are aggregated.
A literature review shows that few methods are able to identify clusters containing two types of points. To solve this problem, we construct a method for identifying twocomponent clusters. Inspired by the 2D Fourier transform, we first determine eps for constructing the density domain of point data. Second, we transform the point data into the 2D density domain, in which the x and y axes represent the local density of each type of point around each point, respectively. The thresholds for identifying two-component clusters are then determined. Points located between the thresholds are extracted to finally form clusters based on their density and connectivity. The rest of the paper is arranged as follows. Section 2 introduces some basic concepts of two-component clusters. In Section 3, the clustering algorithm is described in detail, and Section 4 presents clustering results with synthetic data. A case study using Beijing taxicab data is given in Section 5. Section 6 provides a summary of this paper, as well as directions for future research.

Basic concepts
As our method uses density-based clusters, some of the basic concepts relating to two types of points are modified and extended from those defined in DBSCAN.
(1) Mixed eps-neighborhood of Point p : n points, including n1 of Type A (o i i ¼ 1; 2; . . . n1 ð Þ ) and n2 of Type B (q j j ¼ 1; 2; . . . n2 ð Þ ), in the circle centered at p with radius eps, where n ¼ n 1 þ n 2 : Here, Point p can be a point of A or B (Figure 1  of B. In Figure 2, p 2 and p 3 are the mixed core points wrt eps = e, MinPtA = 2, and MinPtB = 2. Because no point of A is in the mixed eps-neighborhood of Point p 1 , it cannot be a mixed core point. Note that we abbreviate 'high density of A and high density of B' as HA-HB hereafter. Similarly, 'high density of A and low density of B' is denoted by HA-LB, and 'low density of A and high density of B' is shortened to LA-HB. (3) Mixed core point for HA-LB: Point p is a mixed core point wrt eps, MinPtA, and MaxPtB if its mixed eps-neighborhood N m eps p ð Þ contains at least MinPtA points of A and at most MaxPtB points of B. The definition of the mixed core point for LA-HA is similar. In this paper, we only consider three types of mixed core points: HA-HB, HA-LB, and LA-HB. Other types of core points can be defined in the same way.
(4) Mixed density-connected: Point o is considered to be density-connected to Point p if there is a collection of points q 1 ; q 2 . . . q n (with o ¼ q 1 ; p ¼ q n ) such that q iÀ1 2 N m eps q i ð Þ i ¼ 2; 3; . . . ; n ð Þmust be a mixed core point and belongs to the same type.
(5) HA-HB cluster: Cluster C is an HA-HB cluster if it satisfies following three conditions. (i) C is composed of mixed core points for HA-HB; (ii) "p; q , if p 2 C and q is mixed density-connected to p, then q 2 C ; (iii) "p; q 2 C , p is mixed density-connected to q. Other types of clusters (namely, HA-LB clusters and LA-HB clusters) can be defined in the same way. In this paper, HA-HB, HA-LB, and LA-HB clusters are all two-component clusters.

Two-component clustering method
The method is divided into four steps. First, we determine eps , which is used to construct the density domain of point data. Second, the point data are transformed into the density domain, which is displayed as a 2D image ( Figure 3). Third, we estimate four thresholds for separating two-component clusters from noise. Specifically, one is for high-density points of Type A (X h ) (i.e., MinPtA in Section 2), one is for low-density points of A (X l ) (MaxPtA), and similarly for points of Type B (Y h and Y l ) (MinPtB and MaxPtB). Using these thresholds, the density domain is divided into nine regions ( Figure 3). Fourth, we extract mixed core points (e.g.,, mixed core points for HA-HB clusters are contained in the area between X h and Y h ) and identify different types of clusters from the core points based on the mixed density-connectivity mechanism. We now describe each step in detail.

Choosing eps by determining the clustering scale
Given a point, its location in the density domain is determined by the local density of Type A (x-axis) and that of Type B (y-axis) ( Figure 3). Specifically, the local density of A can be represented by the number of points of Type A in its eps-neighborhood, and similarly for the local density of B. Differing from the estimation of eps in the previous literature (Ester et al. 1996, Ashour et al. 2011, Jiang et al. 2011, which is used to separate the clustered points from noise after fixing MinPts, in this context, eps is the only parameter to transform spatial points into the density domain. In practice, the value of eps is related to the clustering scale in the data and should be in a certain range. On the one hand, if eps is too large (say, larger than the scale of a cluster), the local density of a point in the cluster will be underestimated because more noisy points are included in its epsneighborhood (Pei et al. 2007). On the other hand, if eps is too small, the estimation of the local density of a point could be unstable because of the inhomogeneous distribution of points over space (Pei et al. 2006). To determine the relationship between eps and the clustering scale, which may result in a good clustering result, we first estimate the clustering scale using H-function. Then, we estimate the clustering scale factor between eps and clustering scale (note that eps is the product of the estimated clustering scale and the clustering scale factor).
Previous studies have shown that the clustering scale can be estimated by Ripley's Kfunction and H-function (Ripley 1976, Besag 1977. The K-function is defined as where & h ij is the distance between Point i and Point j, n is the number of points, and λ is the density of points in the research area. The H-function is defined as Nevertheless, because the estimation of the clustering scale is significantly affected by the distance between clusters (i.e., domains in Kiskowski et al. (2009)), Kiskowski et al. (2009) used the derivative of the H-function to estimate the clustering scale (in detail, the clustering scale is the value yielding the minimum of the derivative function). In this paper, we use the derivative of Ripley's H-function to estimate the clustering scale.
Simulations under different scenarios also support this theory (see the supplemental document for details).
To determine the appropriate values of clustering scale factor, we generated different synthetic data sets. Experiments with synthetic data have indicated that when eps is around a quarter of the clustering scale (i.e., the scale of cluster(s) in a point data set), the classification error rate is minimized (for details of the experiments, please see the supplemental document). For this reason, we set the clustering scale factor to 1/4 when performing the clustering method.

Density domain of two types of points
With eps, we can transform data containing two types of points into the density domain. If we use two thresholds for each type of point (one is the low-density threshold and the other is the high-density threshold), then there are four thresholds (specifically, X h , X l , Y h ; and Y l ). The density domain is divided into nine regions (Figure 3). Thus, we call Region 1 the LA-LB region and name each region accordingly up to Region 9, which is the HA-HB region.

Determination of the thresholds and two-component clusters
As discussed in Section 3.3, four thresholds (i.e., X h , X l , Y h , and Y l ) must be estimated to divide the density domain. Next, we take X h and X l as examples to illustrate the determination of the thresholds. Threshold X h is used to define high-density clusters of A, whereas X l is used to define low-density groups of A. These two thresholds are determined under the assumption that all points of A are randomly distributed over the region (namely, the points obey a spatial Poisson process). As a result, before determining the thresholds, we must estimate the density (λ ) of the points of A as follows: where d i;k is the distance of the kth closest point to Point i, and n is the number of points (Pei et al. 2007). Here, we set k = 1. The thresholds are then determined from the probability density function of the Poisson process (i.e., P x ¼ k ð Þ¼ e Àλπeps 2 λπeps 2 ð Þ k k! ) at a certain significance level, where eps has already been determined based on the estimation of the clustering scale, and λ is the density of the points of A. We set the significance level to 0.2 and then estimate X h and X l such that 10% of the points are greater than X h and 10% are less than X l . As we know, the significant level is traditionally set to 0.1, 0.05, and 0.01. In the context, the larger value indicates that more clusters will be found. Here, we set the significant level to a larger value (i.e., 0.2) to ensure that all possible clusters will be identified even if some false positives are generated. The other parameters (i.e., Y h and Y l ) are estimated in the same way. Once all thresholds have been estimated, they are used to divide the density domain into nine regions.

Generating density-connected clusters
With the estimated parameters, different types of clusters can be generated from corresponding regions of the density domain. Note that only three types of clusters (HA-HB, HA-LB, and LA-HB) are considered in this paper. HA-HB clusters can be generated from the points in HA-HB region of the density domain based on the mixed density-connected mechanism. Similarly, HA-LB (LA-HB) clusters can be produced from points in HA-LB (LA-HB) region. Here, we only describe how an HA-HB cluster is generated, and other types of clusters can be identified in the same way. We first randomly choose a mixed core point (it can be either a point of Type A or that of Type B) from HA-HB region and assign it a cluster ID. Then, we find the next mixed core point for HA-HB in its neighborhood and assign the same cluster ID. This process proceeds until no unclassified mixed core point for HA-HB is found in the neighborhoods of the classified points of the same ID. All points with the same ID thus form a cluster. Another cluster begins with an unclassified mixed core point for HA-HB. All HA-HB clusters are identified until no unclassified mixed core point for HA-HB can be found in the whole data set. Similar to the relationship between 1D Fourier and 2D Fourier transform (specifically, 2D Fourier transform can be realized by performing 1D Fourier transform horizontally, followed by the same transform vertically), our method can be accomplished by overlaying the results of clustering two types of points separately. We omit the details for short. With the same idea, our method can be extended to those multi-component data, which contain more than two types of points.

Complexity of the algorithm
The complexity of the clustering algorithm is the sum of the complexities of the following procedures: the determination of eps, estimation of the density thresholds, and the connection process of mixed core points. Because the major step in determining eps is the calculation of the kth nearest neighbor, the complexity of this step is O(nlog(n)), where n is the number of points (Pei et al. 2006). The major step in estimating the thresholds is the same as that for eps. Therefore, its complexity is O(nlog(n)). The complexity of the density-connected process is again O(nlog(n)) (Pei et al. 2006). Hence, the overall complexity of the algorithm is O(nlog(n)).

Experiments using synthetic data
To validate the method, we use the synthetic data shown in Figure 4a, where squares symbolize points of Type A and triangles symbolize points of Type B. We generate five clusters: the inverse 'T' (top, HA-LB), 'cross' (near middle, LA-HB), 'square' (right-hand side, HA-HB),  To determine the clusters, we need to estimate eps in light of the clustering scale. The derivative of the H-function is shown in Figure 5, indicating that the clustering scale is 13.01. As indicated in Section 3.2, we set the clustering scale factor to 1/4 (namely, eps is a quarter of the clustering scale, or 3.25). The density domain is shown in Figure 6. According to the method described in Section 3.4, the thresholds of X h , X l , Y h , Y l are estimated to be 14, 6, 14, and 6, respectively. The regions of different clusters are then generated using the thresholds. With these thresholds, we can determine the HA-HB, HA-LB, and LA-HB clusters, as displayed in Figure 4b. The error rate of classification is 2.25%, indicating that our method has successfully identified the twocomponent clusters in the data set.
To further validate our method, we calculate the error rates at different values of the clustering scale factor, ranging from 1/16 to 1 (the first line in Table 1). The result shows that the minimum error rate is acquired when the clustering scale factor is 1/4, which confirms our strategy of estimating eps. We then generate 199 further data sets for further verification. The mean error rates at different clustering scale factors are shown in Table 1 (the second line). The results demonstrate that low error rates are found when eps is between 3/16 and 3/8 of the clustering scale. Specifically, the lowest error rate is found at 1/4. Experiments with four different types of synthetic data in the supplemental document are also in agreement with the result.

Case study
Although the spatial distribution of taxi trip ODs has often been considered (Lee et al. 2008, Andrienko et al. 2009, Yue et al. 2011, Guo et al. 2012, Liu et al. 2012 Pan et al. 2013), few studies have sought to identify two-component clusters. In this section, we apply our method to Beijing taxicab data. The data include the ODs of more than 18,000 taxis in Beijing on Monday 18 April 2011. We consider two periods (7am-10am and 4pm-7pm), representing the morning and afternoon rush hours, to examine the distribution of two-component clusters. The ODs of each period are displayed in Figure 7.

Results
The value of eps for generating the density domain was estimated at 150 m. The thresholds of X h , X l , Y h , Y l for the morning period were determined to be 24, 13, 21, and 10, and those for the afternoon were 26, 14, 19, and 9, respectively. Note that λ is calculated using Equation (3). The traditional method calculates λ by dividing the number of ODs (n) by the area of the whole region (S). This may underestimate λ , because taxis cannot reach some regions (e.g., buildings and lakes) that are still included in the area (i.e., S). Pei et al. (2007) suggest using a method based on the nearest distance to reduce the estimation bias. This is Figure 6. Density domain of two types of points. Vertical blue lines indicate X l and X h (from left to right), and the horizontal blue lines indicate Y l and Y h (from bottom to top). The color in the figure represents the number of points sharing the same densities (the deeper the shade, the more points that share the same density of A and B). because, as indicated in Equation (3), λ can be thought of as the ratio between the number of points (if k = 1) and the sum of the areas of the circles, each of which is centered at a data point and has a radius of the distance to its nearest neighbor. Thus, areas that are not available for taxis' arrival can be excluded when calculating λ (because no such circle can be generated in those areas). For details, refer to Pei et al. (2007). Figures 8 and 9 show the clustering results of the morning and afternoon periods, respectively.

Discussion
Previous studies have shown that the distribution of taxi trip ODs is closely related to land-use types (Guo et al. 2012, Liu et al. 2012, Zhang et al. 2012, Pan et al. 2013. We seek to explain the clustering results based on the relationship between the cluster locations and their land-use types. The land-use map of Beijing is shown in Figure 10. The clusters are assigned to different land-use types through the following method. In each cluster, every point is assigned to the specific land-use type of its nearest cell. The nearest cell is determined based on the distance between the point and the cell centers. The cluster is assigned to the land-use type of the most points in the cluster. Note that in Figures 8 and 9, the color of each cluster represents its land-use type. The numbers and proportions (i.e., the proportion of the total assigned to each cluster type) of different clusters are listed in Tables 2 and 3. From Table 2, we see that in the morning, the greatest number of 'Residential' clusters is of type HO-LD, followed by LO-HD and HO-HD. The proportion of HO-LD is also larger than that of LO-HD. In the afternoon (Table 3), the numbers of clusters labeled as 'Residential' are, in descending order, of types LO-HD, HO-HD, and HO-LD. At this time of day, the proportion of LO-HD is larger than that of HO-LD. The spatiotemporal variation of clusters in the residential areas indicates that, in the morning, hot spots that are short of taxis are dominant, whereas in the afternoon, hot spots generating a surplus of taxis are more common.
Unlike 'Residential' zones, the numbers and proportions of HO-LD clusters are smaller than those of LO-HD around 'Public facilities' and 'Educational institutions' in the morning, while in the afternoon the numbers and proportions of HO-LD clusters are greater than those of LO-HD. These spatiotemporal variations indicate the inverse pattern from that in 'Residential' areas. The above phenomena agree with the common-sense notion that more residents go to work (i.e., to public facilities and educational institutions) in the morning and leave for home in the afternoon. In addition, the number of HO-HD clusters is the highest for 'Educational institutions' in both morning and afternoon. The reason for 'Educational institutions' generating many HO-HD clusters is that in China many 'Educational institutions' also include residential areas mainly for staffs and their families, so they are potentially hot places for both origins and destinations no matter what the period is. As a result, it is not strange to see many HO-HD clusters appearing in 'Educational institutions'.
Note that in commercial areas, besides HO-HD clusters, the number of LO-HD clusters is larger than that of HO-LD in both periods. The existence of a high number of HO-HD clusters is caused by the high mobility of residents in commercial areas. Because of the high density of buildings and population in the 'Commercial' zone, taxis are not usually allowed to wait for passengers. This reduces the number of HO-LD clusters to some extent. A similar phenomenon is found in the 'Industrial' areas. This is because most industrial areas are located in the suburbs of Beijing, and taxis tend to leave soon after dropping off (because it is difficult to find a fare in areas far from the city center), which causes the number of HO-LD clusters to be consistently less than that of LO-HD.
To sum up, in the morning, the proportion (i.e., more HO-LD and fewer LO-HD and HO-HD) of three types of cluster for 'Commercial,' 'Public facilities,' 'Industrial,' and 'Educational institutions' shows inverse pattern (i.e., fewer HO-LD and more LO-HD and HO-HD) to 'Residential', which indicates the interaction between 'Residential' and other land-use types. Nevertheless, in the afternoon, due to 'more HO-HD clusters for "Educational institutions" ' and 'fewer HO-LD cluster for "Commercial" and "Industrial" ' (see explanations above), the inverse pattern is not significant.
We can also study the spatial relationship between transportation centers and HO-HD clusters to verify the clustering result. If the distance between the transportation center and its nearest cluster is less than 500 m, we state that the cluster is related to    Table 4. In the morning, 11 HO-HD clusters (a proportion of 0.5) are found to be related to transport centers, and in the afternoon 14 HO-HD clusters (a proportion of 0.583) are the closest to a transportation center. Both the number and proportion of HO-HD clusters are higher than those of HO-LD and LO-HD. This indicates that in a metropolis like Beijing, to save time over a long journey, passengers commonly take a taxi to their nearest transport center, and then transfer to the subway to reach their destination, or take a taxi to their destination after transferring from a transportation center. The cluster results have been verified by analyzing the relationship between the temporal variation of hot spots, land-use types, and transport centers. Our method effectively locates two-component clusters that may have a completely different meaning from 'traditional' ones. For example, an HO-HD cluster does not indicate a shortage of taxis, whereas an LO-HD cluster implies an urgent demand for taxis, although both would be seen as a cluster of destinations using traditional methods (in other words, they are treated as the same type of cluster if using the traditional methods).

Conclusion
In this paper, we first defined the concept of a two-component cluster. In such clusters, there are two types of points, at least one of which exhibits clustering. We proposed a method of identifying three types of two-component clusters (i.e., HA-HB, HA-LB, and LA-HB). Our method requires five parameters, namely eps, X h , X l , Y h , and Y l . Experiments on synthetic data showed that eps should be around 1/4 of the clustering scale of the data. The other thresholds (i.e., X h , X l , Y h and Y l ) for separating clusters from noise were determined by setting the significance level under the assumption that points are distributed randomly across the whole domain. Two-component clusters can then be found in the HA-HB, HA-LB, and LA-HB regions of the density domain, which are generated from these thresholds. The method was applied to the identification of clusters of taxicab trip ODs. The significant relationship between the locations of the clusters and their land-use types over different periods verifies the efficacy of the twocomponent cluster method.
Two-component clustering provides a new insight into point data anomalies, in which the ground truth may be related to two types of points. Compared to traditional clustering methods for only one type of point, the two-component method provides more information about the clusters. In the case study of taxi data, our method was  195 able to locate places with an urgent demand for taxis, as well as those with a surplus of taxis, by analyzing the information of origins and destinations. This result, although very useful to urban planning, would be difficult to determine based on previous conceptions and methods. Besides trip OD data, this method can also be applied to other two-component datafor example, crime venues and traffic accidents, extreme precipitation and temperature events, and bank and restaurant locations. Moreover, this method can be extended to analyze multi-component data, say, a data set of OD points and Point Of Interests (POIs).