OddBall : Spotting Anomalies in Weighted Graphs

. Given a large, weighted graph, how can we ﬁnd anomalies? Which rules should be violated, before we label a node as an anomaly? We propose the OddBall algorithm, to ﬁnd such nodes. The contributions are the following: (a) we discover several new rules (power laws) in density, weights, ranks and eigenvalues that seem to govern the so-called “neighborhood sub-graphs” and we show how to use these rules for anomaly detection; (b) we carefully choose features, and design OddBall , so that it is scalable and it can work un-supervised (no user-deﬁned constants) and (c) we report experiments on many real graphs with up to 1.6 million nodes, where OddBall indeed spots unusual nodes that agree with intuition.


Introduction
Given a real graph, with weighted edges, which nodes should we consider as "strange"? Applications of this setting abound: For example, in network intrusion detection, we have computers sending packets to each other, and we want to know which nodes misbehave (e.g., spammers, port-scanners). In a who-callswhom network, strange behavior may indicate defecting customers, or telemarketers, or even faulty equipment dropping connections too often. In a social network, like FaceBook and LinkedIn, again we want to spot users whose behavior deviates from the usual behavior, such as people adding friends indiscriminately, in "popularity contests".
The list of applications continues: Anomalous behavior could signify irregularities, like credit card fraud, calling card fraud, campaign donation irregularities, accounting inefficiencies or fraud [6], extremely cross-disciplinary authors in an author-paper graph [29], network intrusion detection [28], electronic auction fraud [10], and many others.
In addition to revealing suspicious, illegal and/or dangerous behavior, anomaly detection is useful for spotting rare events, as well as for the thankless, but absolutely vital task of data cleansing [12]. Moreover, anomaly detection is intimately related with the pattern and law discovery: unless the majority of our nodes closely obey a pattern (say, a power law), only then can we confidently consider as outliers the few nodes that deviate.
Most anomaly detection algorithms focus on clouds of multi-dimensional points, as we describe in the survey section. Our goal, on the other hand, is to spot strange nodes in a graph, with weighted edges. What patterns and laws do such graphs obey? What features should we extract from each node ? We propose to focus on neighborhoods, that is, a sphere, or a ball (hence the name OddBall) around each node(the ego): that is, for each node, we consider the induced sub-graph of its neighboring nodes, which is referred to as the egonet. Out of the huge number of numerical features one could extract from the egonet of a given node, we give a carefully chosen list, with features that are effective in revealing outliers. Thus, every node becomes a point in a low-dimensional feature space.
Main contributions of this work are: Of course, there are numerous types of anomalies -we discuss several of them in our technical report [2], but, for brevity, we focus on only the following major types (see Fig.1 for examples and Section 2 for the dataset description): 1. Near-cliques and stars: Those nodes whose neighbors are very well connected (near-cliques) or not connected (stars) turn out to be "strange": in most social networks, friends of friends are often friends, but either extreme (clique/star) is suspicious. 2. Heavy vicinities: If person i has contacted n distinct people in a who-callswhom network, we would expect that the number of phone calls (weight) would be a function of n. Extreme total weight for a given number of contacts n would be suspicious, indicating, e.g., faulty equipment that forces redialing. 3. Dominant heavy links: In the who-calls-whom scenario above, a very heavy single link in the 1-step neighborhood of person i is also suspicious, indicating, e.g., a stalker that keeps on calling only one of his/her contacts an excessive count of times.
The upcoming sections are as follows: We describe the datasets; the proposed method and observed patterns; the experimental results; prior work; and finally the conclusions.

Data Description
We studied several unipartite/bipartite, weighted/unweighted large real-world graphs in a variety of domains, described in detail in Table 1. Particularly, unipartite networks include the following: Postnet contains post-to-post links in a  set of blogs [21], Enron contains emails at Enron collected from about 1998 to 2002 (made public by the Federal Energy Regulatory Commission during its investigation), and Oregon contains AS peering information inferred from Oregon route-views BGP data. Bipartite networks include the following: Auth2Conf contains the publication records of authors to conferences from DBLP, and Don2Com and Com2Cand are from the U.S. Federal Election Commission in 2004 2 , a public record of donations between donors and committees and between committees and political candidates, respectively.
For Don2Com and Com2Cand, the weights on the edges are actual weights representing donation amounts in dollars. For the remaining weighted datasets, the edge weights are simply the number of occurrences of the edges. For instance, if post i contains k links to another post j, the weight of the edge e i,j is set to k.
In our study, we specifically focused on undirected graphs, but the ideas can easily be generalized to directed graphs.

Proposed Method
Borrowing terminology from social network analysis (SNA), "ego" is an individual node.
Informally, an ego (=node) of a given network is anomalous if its neighborhood significantly differs from those of others. The basic research questions are: (a) what features should we use to characterize a neighborhood? and (b) what does a 'normal' neighborhood look like?
Both questions are open-ended, but we give some answers below. First, let's define terminology: the "k -step neighborhood" of node i is the collection of node i, all its k-step-away nodes, and all the connections among all of these nodesformally, this is the "induced sub-graph". In SNA, the 1-step neighborhood of a node is specifically known as its "egonet".
How should we choose the value of k steps to study neighborhoods? Given that real-world graphs have small diameter [3], we need to stay with small values of k, and specifically, we recommend k=1. We report our findings only for k=1, because using k > 1 does not provide any more intuitive or revealing information, while it has heavy computational overhead, possibly intractable for very large graphs.

Feature Extraction
The first of our two inter-twined questions is which statistics/features to extract from a neighborhood.
There is an infinite set of functions/features that we could use to characterize a neighborhood (number of nodes, one or more eigenvalues, number of triangles, effective radius of the central node, number of neighbors of degree 1, etc etc). Which of all should we use?
Intuitively, we want to select features that (a) are fast-to-compute and (b) will lead us to patterns/laws that most nodes obey, except for a few anomalous nodes. We spend a lot of time experimenting with about a dozen features, trying to see whether the nodes of real graphs obey any patterns with respect to those features (see our technical report [2]). The majority of features lead to no obvious patterns, and thus we do not present them.
The trimmed-down set of features that are very successful in spotting patterns, are the following: The next question is how to look for outliers, in such an n-dimensional feature space, with one point for each node of the graph. In our case, n=4, but one might have more features depending on the application and types of anomalies one wants to detect. A quick answer to this would be to use traditional outlier detection methods for clouds of points using all the features.
In our setting, we can do better. As we show next, we group features into carefully chosen pairs, where we show that there are patterns of normal behavior (typically, power-laws). We flag those points that significantly deviate from the discovered patterns as anomalous. Among the numerous pairs of features we studied, the successful pairs and the corresponding type of anomaly are the following: -E vs N : CliqueStar : detects near-cliques and stars -W vs E: HeavyVicinity: detects many recurrences of interactions -λ w vs W : DominantPair : detects single dominating heavy edge (strongly connected pair)

Laws and Observations
The second of our research questions is what do normal neighborhoods look like. Thus, it is important to find patterns ("laws") for neighborhoods of real graphs, and then report the deviations, if any. In this work, we report some new, surprising patterns: For a given graph G, node i ∈ V(G), and the egonet G i of node i; Observation 1 (EDPL: Egonet Density Power Law) the number of nodes N i and the number of edges E i of G i follow a power law.
In our experiments the EDPL exponent α ranged from 1.10 to 1.66. Fig. 2 illustrates this observation, for several of our datasets. Plots show E i versus N i for every node (green points); the black circles are the median values for each bucket of points (separated by vertical dotted lines) after applying logarithmic binning on the x-axis as in [23]; the red line is the least squares(LS) fit on the median points. The plots also show a blue line of slope 2, that corresponds to cliques, and a black line of slope 1, that corresponds to stars. All the plots are in log-log scales.
Observation 2 (EWPL: Egonet Weight Power Law) the total weight W i and the number of edges E i of G i follow a power law. Fig. 3 shows the EWPL for (only a subset of) our datasets (due to space limit). In our experiments the EWPL exponent β ranged up to 1.29. Values of β > 1 indicate super-linear growth in the total weight with respect to increasing total edge count in the egonet.
Observation 3 (ELWPL: Egonet λ w Power Law) the principal eigenvalue λ w,i of the weighted adjacency matrix and the total weight W i of G i follow a power law. Fig. 4 shows the ELWPL for a subset of our datasets. In our experiments the ELWPL exponent γ ranged from 0.53 to 0.98. γ=0.5 indicates uniform weight distribution whereas γ˜1 indicates a dominant heavy edge in the egonet, in which case the weighted eigenvalue closely follows the maximum edge weight. γ=1 if the egonet contains only one edge.
Observation 4 (ERPL: Egonet Rank Power Law) the rank R i,j and the weight W i,j of edge j in G i follow a power law.
Here, R i,j is the rank of edge j in the sorted list of edge weights. ERPL suggests that edge weights in the egonet have a skewed distribution. This is intuitive; for example in a friendship network, a person could have many not-so-close friends (light links), but only a few close friends (heavy links).
Next we show that if the ERPL holds, then the EWPL also holds. Given an egonet G i , the total weight W i and the number of edges E i of G i , let W i denote the ordered set of weights of the edges, W i,j denote the weight of edge j, and R i,j denote the rank of weight W i,j in set W i . Then, Proof. For brevity, we give the proof for θ < −1 -other cases are similar. If W i,j = cR θ i,j , then W min = cE θ i -the least heavy edge l with weight W min is ranked the last, i.e. R i,l = E i . Thus we can write W i as For sufficiently large E i and given θ < −1, the second term in parenthesis goes to 0. Therefore;

Anomaly Detection
We can easily use the observations given in part 3.2 in anomaly detection since anomalous nodes would behave away from the normal pattern. Let us define the y-value of a node i as y i and similarly, let x i denote the x-value of node i for a particular feature pair f (x, y). Given the power law equation y = Cx θ for f (x, y), we define the outlierness score of node i to be Intuitively, the above measure is the "distance to fitting line". Here we penalize each node with both the number of times that y i deviates from its expected value Cx θ i given x i , and with the logarithm of the amount of deviation. This way, the minimum outlierness score becomes 0 -for which the actual value y i is equal to the expected value Cx θ i . This simple and easy-to-compute method not only helps in detecting outliers, but also provides a way to sort the nodes according to their outlierness scores. However, this method is prone to miss some outliers and therefore could yield false negatives for the following reason: Assume that there exist some points that are far away from the remaining points but that are still located close to the fitting line. In our experiments with real data, we observe that this usually happens for high values of x and y. For example, in Fig. 2(a), the points marked with left-triangles ( ) are almost on the fitting line even though they are far away from the rest of the points.
We want to flag both types of points as outliers, and thus we propose to combine our heuristic with a density-based outlier detection technique. We used LOF [7], which also assigns outlierness scores out-lof(i) to data points; but any other outlier detection method would do, as long as it gives such a score. To obtain the final outlierness score of a data point i, one might use several methods such as taking a linear function of both scores and ranking the nodes according to the new score, or merging the two ranked lists of nodes, each sorted on a different score. In our work, we simply used the sum of the two normalized(by dividing by the maximum) scores, that is, out-score(i) = out-line(i)+out-lof(i).

Experimental Results
CliqueStar Here, we are interested in the communities that the neighbors of a node form. In particular, CliqueStar detects anomalies having to do with nearcliques and near-stars. Using CliqueStar, we were successful in detecting many anomalies over the unipartite datasets (although it is irrelevant for bipartite graphs since by nature the egonet forms a "star").
In social media data Postnet, we detected posts or blogs that had either all their neighbors connected (cliques) or mostly disconnected (stars). We show some illustrative examples along with descriptions from Postnet in Fig. 1. See  Fig.2a for the detected outliers on the scatter-plot from the same dataset.
In Enron (Fig.2b), the node with the highest anomaly score turns out to be "Kenneth Lay", who was the CEO and is best known for his role in the Enron scandal in 2001. Our method reveals that none of his over 1K contacts ever sent emails to each other.
In Oregon (Fig.2c), the top outliers are the three large ISPs ("Verizon", "Sprint" and "AT&T"). HeavyVicinity In our datasets, HeavyVicinity detected "heavy egonets", with considerably high total edge weight compared to the number of edges. We mark the anomalies in Fig.3 for several of our datasets. See [2] for results on all the datasets and further discussions. In Com2Cand (Fig.3a), we see that "Democratic National Committee" gave away a lot of money compared to the number of candidates that it donated to. In addition, "(John) Kerry Victory 2004" donated a large amount to a single candidate, whereas "Liberty Congressional Political Action Committee" donated a very small amount ($5), again to a single candidate. Looking at the Candidates plot for the same bipartite graph (Fig.3b), we also flagged "Aaron Russo", the lone recipient of that PAC. (In fact, Aaron Russo is the founder of the Constitution Party which never ran any candidates, and Russo shut it down after 18 months.) In Don2Com (Fig.3c), we see that "Bush-Cheney '04 Inc." received a lot of money from a single donor. On the other hand, we notice that the "Kerry Committee" received less money than would be expected looking at the number of checks it received in total. Further analysis shows that most of the edges in its egonet are of weight 0, showing that most of the donations to that committee have actually been returned.  DominantPair Here, we find out whether there is a single dominant heavy edge in the egonet. In other words, this method detected "bursty" if not exclusive edges.
In Postnet (Fig.4a) nodes such as "ThinkProgress"'s post on a leak scandal 3 and "A Freethinker's Paradise" post 4 linking several times to the "ThinkProgress" post were both flagged. On another note, the slope of the fitting line is close to 0.5, pointing to uniform weight distribution in egonets overall. This is expected as most posts link to other posts only once.
In Com2Cand (Fig.4b), "Democratic National Committee" is one of the top outliers. We would guess that the single large amount of donation was made to "John F. Kerry". Counterintuitively, however, we see that that amount was spent for an opposing advertisement against "George W. Bush". DominantPair flagged extremely focused authors (those publish heavily to one conference) in the DBLP data, shown in Fig.3c. For instance, "Toshio Fukuda" has 115 papers in 17 conferences (at the time of data collection), with more than half (87) of his papers in one particular conference (ICRA). In addition, "Averill M. Law" has 40 papers published to the "Winter Simulation Conference" and nowhere else. On the other extreme, another interesting point is "Wei Li", with many papers, who gets them published to as many distinct conferences, probably once or twice to each conference (uniform rather than 'bursty' distribution).
See [2] for results on all the datasets and further discussions.

Outlier Detection
Outlier detection has attracted wide interest, being a difficult problem, despite its apparent simplicity. Even the definition of the outlier is hard to give: For instance, Hawkins [16] defines an outlier as "an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism." Similar, but not identical, definitions have been given by Barnett and Lewis [5], and Johnson [19]. Outlier detection methods form two classes, parametric (statistical) and nonparametric (model-free). The former includes statistical methods that assume prior knowledge of the underlying data distribution [5,16]. The latter class includes distance-based and density-based data mining methods. These methods typically define as an outlier the (n-D) point that is too far away from the rest, and thus lives in a low-density area [20]. Typical methods include LOF [7] and LOCI [27]. These methods not only flag a point as an outlier but they also give outlierness scores; thus, they can sort the points according to their "strangeness". Many other density-based methods especially for large high-dimensional data sets are proposed in [1,4,11,15]. Finally, most clustering algorithms [9,17,25] reveal outliers as a by-product.

Anomaly Detection in Graph Data
Noble and Cook [26] detect anomalous sub-graphs using variants of the Minimum Description Length (MDL) principle. Eberle and Holder [13] use MDL as well as other probabilistic measures to detect several types of anomalies (e.g. unexpected/missing nodes/edges). Frequent subgraph mining [18,30] is used to detect non-crashing bugs in software flow graphs [22]. Chakrabarti [8] uses MDL to spot anomalous edges. Sun et al. [29] use proximity and random walks, to assess the normality of nodes in bipartite graphs. OutRank and LOADED [14,24] use similarity graphs of objects to detect outliers.
In contrast to the above, we work with unlabeled graphs. We explicitly focus on nodes, while interactions are also considered implicitly as we study neighborhood sub-graphs. Finally, we consider both bipartite and unipartite graphs as well as edge weights.

Conclusion
This is one of the few papers that focus on anomaly detection in graph data, including weighted graphs. We propose to use "egonets", that is, the induced subgraph of the node of interest and its neighbors; and we give a small, carefully designed list of numerical features for egonets. The major contributions are the following: 1. Discovery of new patterns that egonets follow, such as patterns in density (Obs.1: EDPL), weights (Obs.2: EWPL), principal eigenvalues (Obs.3: EL-WPL), and ranks (Obs.4: ERPL). Proof of Lemma 1, linking the ERPL to the EWPL. 2. OddBall, a fast, un-supervised method to detect abnormal nodes in weighted graphs. Our method does not require any user-defined constants. It also assigns an "outlierness" score to each node. 3. Experiments on real graphs of over 1M nodes, where OddBall reveals nodes that indeed have strange or extreme behavior.
Future work could generalize OddBall to time-evolving graphs, where the challenge is to find patterns that neighborhood sub-graphs follow and to extract features incrementally over time.