A Random-Surfer Web-Graph Model

In this paper we provide theoretical and experimental results on a random-surfer model for construction of a random graph. In this model, a new node connects to the existing graph by choosing a start node at random and then performing a short random walk. We show that in certain formula-tions, this results in the same distribution as the preferential-attachment random-graph model, and in others we give a di-rect analysis of power-law distribution of degrees or “virtual degrees” of the resulting graphs. We also present experimental results for a number of settings of parameters that we are not able to analyze mathematically.


Introduction
There has been substantial work in recent years on the preferential attachment random graph model. In this model, a graph is constructed in the following manner. Nodes arrive one at a time, and each new node makes k connections to the existing graph. However, unlike classic random graph models, these connections are not made uniformly at random, but rather with probability proportional to the degree of existing nodes in the graph. This process is known to produce graphs with a power law degree distribution [2] and that have high conductance [15], and has been proposed as a model for graphs such as the graph of links between pages on the World Wide Web.
A natural question that arises when considering the preferential attachment model is why: why should a new node connect to existing nodes with probability proportional to their degree? Is it because we imagine that high degree nodes are "better" (and the degree of a node is an indicator of its quality) or is it for some other reason?
The starting point for this paper is the observation that a simple "random surfer" model provides a natural explanation for preferential attachment. In particular, imagine that each new node (a person setting up their web page) puts in k links into the existing graph by picking a random start node and then randomly surfing the web until it finds k interesting pages to connect to. Imagine also that each page is equally likely to be interesting to the surfer and each link is bidirectional (so we have an undirected graph). Then, if the probability p of a page being "interesting" is sufficiently small, * Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213. {avrim, hubert, rweba} at cs.cmu.edu. these connections will be made (approximately) according to the stationary distribution of the walk, which is exactly the preferential attachment distribution. Furthermore, since such graphs have high conductance [15], one should not need an extremely low value of p for this to hold. Thus, preferentialattachment may arise even if all nodes are in a sense "equally good", and differences between degrees may not necessarily be an indicator of differences in inherent quality.
Based on this as motivation, in this paper we propose and analyze several "random surfer" models for graph construction. We also give a number of experimental results, both for models we know how to analyze and for several that we do not. Interestingly, the models we are best able to analyze in this setting are all directed graph models, rather than undirected models as the one described above. In addition, some of these models can be thought of as making a bridge between the preferential attachment model and the copying model of [13].

Random Surfer Models
In this section, we describe several random surfer models that we will examine in the rest of the paper. In each model, nodes arrive one at a time, making k connections to the existing graph. In some models these connections will be viewed as directed edges, and in some as undirected edges. All our models begin with a single start node v 0 having k self-loops. In general, we use v t to denote the vertex added in the t th step, and n as the total number of vertices.
To motivate our first model, note that if the connections to the existing graph are made uniformly at random, then we have an online version of the standard Erdos-Renyi graph model, and with high probability the maximum degree will be O(log n). On the other hand, suppose we make each connection by first picking a random start node in the existing graph, and then taking a random walk of exactly one step. Then, in the directed case, this will just produce a star (all edges will point to the root v 0 ), and in the undirected case, it is not hard to show that there is a good chance this produces something star-like of maximum degree Ω(n). 1 However, if we flip a coin and with probability p ∈ (0, 1) connect to the random start and with probability 1 − p take a 1-step walk, then we get something much more natural. MODEL 1. (1-STEP WALK WITH SELF-LOOP) In this model, we are given parameters k and p. At time t, vertex v t makes k connections to the existing graph by repeating the following process k times: 1. Pick an existing node v uniformly at random from {v 0 , . . . , v t−1 }.

2.
With probability p stay at v; with probability 1 − p take a 1-step walk to a random neighbor of v.
3. Add an edge from v t to the current node.
In the directed version, the edges added are directed from v t into the existing graph. In the undirected version, edges are undirected.
Our next model is a walk of the form given in the Introduction: instead of taking one step, we keep walking until we find a node of interest and then connect there. In order to make the model easier to think about, for the case k > 1 we imagine after each connection we re-start at a new random start node when performing the next walk.

Flip a coin of bias p
3. If the coin comes up heads add an edge from v t to the current node and stop.
4. If the coin comes up tails, move to a random neighbor of the current node and go back to (2).
In the directed version, the edges added are directed from v t into the existing graph. In the undirected version, edges are undirected.

Directed Walk with Self-Loop.
Our first (simple) result is that the directed version of Model 1 with p = 1/2 is exactly the preferential attachment model. Proof. First, notice that the graph is necessarily a DAG, with all edges pointing backwards in time, and each vertex has an out-degree of k. Now, consider some vertex u in the existing graph with in-degree d u . An edge from the new vertex v t will connect to u if either the process chooses u as the start node of its walk and does not take a step, or else it chooses one of u's in-neighbors u as the start node and does take a step, selecting the edge from u to u. The first case has probability p/t, and the second case has probability (1 − p)d u /(kt). For p = 1/2, the sum of these two quantities is (k + d u )/(2kt) which is exactly proportional to the total degree k + d u of u.
One implication of Theorem 3.1 is that for p > 1/2, the model is a mixture of preferential-attachment and uniformrandom connections. That is, the case p > 1/2 can be viewed as: with probability 2p − 1 choose a neighbor uniformly at random, and with the remaining probability choose a neighbor with probability proportional to degree. This process is known to produce power-law degree distributions. For general p ∈ (0, 1), we now give an argument for powerlaw degree distributions from first principles.
Let d i (t) be the number of nodes with in-degree i at step t, and D i (t) be the expectation of d i (t). We now analyze D i (t) via the following equation.
Observe that the number of nodes with in-degree i increases if the new node connects to an existing node of degree i − 1 and decreases if the new node connects to one of degree i. The term in (3.2) is due to the fact that with probability p the new node is connected to an existing node picked uniformly at random. The term in (3.3) corresponds to the case when with probability 1 − p, the new node connects to a random out-going neighbor of a randomly picked node. The factor k appears in both (3.2) and (3.3) because each new node makes k connections to the existing nodes. The factor 1/k appears only in (3.3) because in the case where a random out-going neighbor is chosen, there are k possible choices. We require for large enough t, a new node does not make more than one connection to an existing node. THEOREM 3.2. There exists a constant C > 0 such that as t tends to infinity, Proof. Using the above equations, the proof follows directly from the techniques of Kumar et al. [13], Cooper and Frieze [10], and Mitzenmacher [16], which allow one to determine the asymptotic behavior of D i (t).
In particular, for each i, we make the substitution D i (t) = c i t in (3.1) -(3.3) to obtain the following equation.
Moreover, using Theorem 4 of [10], one can also show that d i (t) is concentrated around its mean, as stated in the following theorem. ).

Directed Walk with Coin Flipping.
We now consider the directed case of Model 2, for the case k = 1. That is, we connect a new node to the existing graph by picking a start node u uniformly at random, and then performing a random walk, where at each step we halt the walk with probability p. Since k = 1, we can view the random graph constructed as a tree, in which the initial node is the root and every other node has an edge directed to its parent.
To analyze this walk, we define a notion of the virtual degree of a node that is related to the node's actual degree, but also contains terms for the local neighborhood of the node as well. We then prove that for this definition, at each step the expected increase in virtual degree of any given node is proportional to the virtual degree itself. (The virtual degree itself is a fractional quantity, and at each step will change by at most some constant.) Using this, we can show that the expected virtual degrees follow a power-law, and we can also give some bounds on their concentration about their means. Moreover, we can give a crude lower bound on the expected real degree of a given node, which is comparable to its expected virtual degree.
However, our concentration bounds are not sharp enough to give a true proof that the virtual degrees, or the real degrees, follow the power law. DEFINITION 1. Suppose u is a node in the tree. For i ≥ 0, denote L i (u) to be the set of level i descendants of u and l i (u) = |L i (u)|. For instance, L 0 (u) is the set of children, L 1 (u) is the set of grandchildren, and so on. Let β = {β i } i≥0 be a sequence of real numbers such that β 0 = 1. The virtual degree of u with respect to β is In the definition of virtual degree ν(u), the leading term 1 corresponds to the parent of u. We require β 0 = 1, for each child of u should contribute 1 towards the degree of v. We would like the virtual degree to reflect the actual degree of a node, and hence ideally, for i ≥ 1, we would like β i to be small. On the other hand, we also want that the expected increase in the virtual degree ν(u) of node u in each step to be proportional to its current virtual degree. The following theorem states we can satisfy these requirements simultaneously.
Proof. We fix the coin flipping probability p and find some sequence β that satisfies the requirements.
For convenience, we denote q = 1 − p and L −1 (u) = {u}. Then, for i ≥ 0, if a new connection is made to a node in L i−1 (u), then the increase in ν(u) is β i .
Fix i ≥ 0. We first calculate the probability that a new connection is made to a node in L i−1 (u). Recall that we first pick a node uniformly at random to start the directed random walk. If we end up making a new connection to a node in L i−1 (u), we must have begun the random walk at some node in L i−1+j (u), for some j ≥ 0.
We fix some j ≥ 0 and calculate the probability that the random walk starts at some node in L i−1+j (u) and ends up at some node in L i−1 (u). Note that there are l i−1+j (u) nodes to start and there are j hops to be made. Hence, the probability is l i−1+j (u)/t · q j · p.
It follow that the probability that a new connection is made to some node in L i−1 (u) is p t j≥0 q j l i−1+j (u). Hence, the expected increase in ν(u) from step t to step t + 1 is Recall we wish that the above quantity to be equal to Hence, it suffices to find a sequence β such that the corresponding coefficients of l k (u) are equal.
For the rest of the discussion, we consider the virtual degree defined with respect to some sequence β that satisfies Theorem 3.4. We next explore how the virtual degree of a particular node changes with time. Define ν t (u) to be the virtual degree of node u at step t and t u to be the time when node u first appears. Then, it follows that ν t u (u) = 1, since each new node is a leaf when it first appears.

Proof. For any t > t u , we have from Theorem 3.4 that
Hence, We next give an intuition, similar in spirit to [3], of how Theorem 3.5 suggests that the virtual degrees of the random graph should follow the power law. Suppose the random process is run for n steps to form a random graph with n nodes. Then, from Theorem 3.5, the expected virtual degree of the ith node joining the graph is Θ((n/i) p ). If we let κ ≈ Θ((n/i) p ), we would have i ≈ Θ(nκ −1/p ). Observing that nodes joining later should probably have smaller virtual degrees, one might expect that the proportion of nodes having virtual degrees smaller than κ to be 1 − Θ(κ −1/p ). Differentiating this quantity with respect to κ, we conjecture that the proportion of nodes having degree κ should be κ −(1/p+1) .
Unfortunately, we do not have a strong enough concentration bound that would allow us to make the above intuition rigorous. However, using martingale techniques, we can show that the virtual degree cannot be too much larger than its mean for the case when the coin flipping probability p > 1/2. THEOREM 3.6. There exists a constant C > 0 such that for coin flipping probability p > 1/2 and any ρ ≥ 1, Proof. Consider a node u and recall that t u is the time when it first appears. Define a i = 1 + p/i. Recall from the proof of Theorem 3 Recall that the sequence {β k } tends to zero. Hence, it follows that |ν i (u) − ν i−1 (u)| = Θ(1), and we have , and so |D i | ≤ K i . By the Azuma-Hoeffding martingale inequality, we have for any x > 0, Observe that for p > 1/2, we have Hence, for some large enough Observing that where C > 0 is a constant large enough to absorb the 1.

A Crude lower bound for the expected real degree.
Recall that for a given node u in the tree and i ≥ 0, L i (u) is the set of level i descendants of u and l i (u) = |L i (u)|. In particular, l 0 (u) is the number of children node u has. We can give a crude lower bound for l 0 (u) for any given node u. Proof. Let the number of level i descendants of node u at time step t be l t i (u). It follows that Suppose that for some constant A > 0, for some t > 0, and α, we have E[l t 0 (u)] ≥ At α . Observing that for t ≥ 1, Note that for t = t u + 1, E[l t 0 (u)] = Θ(1). Hence, it follows that E[l t 0 (u)] ≥ Ω((t/t u ) p(1−p) ).

Undirected Walk without Self-loop.
We now consider the model mentioned when motivating Model 1 in which a new connection is made to a random neighbor of a randomly selected node. We show that there is a node, namely the initial node, that in expectation has degree linear in the size of the random tree produced. Thus, the self-loop in Model 1 is crucial for producing natural graphs.
THEOREM 3.8. Under the undirected walk without selfloop model, the expected number of leaves connected to the initial node in the random tree produced is Ω(n) , where n is the number of nodes.
Proof. Let L n be number of leaves connected to the initial node v 0 at step n and D n be the degree of the initial node v 0 at time n.
Suppose we are at step n. With probability at least L n /n, a leaf of v 0 would be picked and after one jump, a new connection would be made to v 0 , causing the number of leaves connecting to v 0 to increase by 1. On the other hand, with probability 1 n · L n Dn , the initial node v 0 is picked and after one jump a new connection is made to an existing leaf, causing the number of leaves connected to v 0 to decrease by 1. Hence Hence, E[Z n ] ≥ Π n−1 i=3 (1 + 1/i)E[Z 3 ] = Ω(n) and so E[L n ] ≥ Ω(n).

Experimental results
All experiments were the average of 100 runs with a size n = 100, 000 nodes and k = 1, i.e. the random graph produced is a tree. In each case, we investigate how the average proportion P d of nodes having degree d varies with d. Since we wish to observe whether the degree distribution follows a power law, we plot log 10 P d against log 10 d, for d up to 40. All four models exhibits power-law like phenomenon. Figure 5 shows the degree distribution for the four models and they behave similarly, although the maximum degree seen is much larger for the directed models than for the undirected ones. Figure 1 shows experimentally that the power-law phenomenon exhibited by the degree distribution becomes more apparent as the probability p decreases and the degree d increases. Notice that for p = 1, this is just the Erdos-Renyi random graph model, which does not obey the power law. Moreover, the maximum degree seen for p = 1 is only about 20. As p gets smaller the graph can be fitted better with a straight line. On the hand, the portion of the graph corresponding to large degrees can be fitted well with a straight line. Note that even

Directed walk with coin flips.
We do not have a proof, but Figure 2 is very similar to Figure 1, which indicates that in this case the degrees may be following a power law.

Undirected walk with self-loops.
We do not know how to analyze this model yet. As seen in Figure 3, there are indications that power law phenomenon is exhibited by large degrees. On the other hand, the distribution of degrees may follow some other nice distribution that is not very far from power law (e.g. log-normal distribution).

Undirected walk with coin flips.
Like the previous model, this model is not easy to analyze. But Figure 4 shows that the degree sequence does not look too different from undirected walk with self-loops model. We know theoretically that if p is very small the degree sequence will tend closer to a power law. Figure 4 indeed shows that for p = 0.05, the graph can be better fitted with a straight line.

Conclusions and Open Questions
In this paper we present some initial analysis and experimental results for several simple random-surfer models for web-graph construction. The models are similar in spirit to the copying model of [13], and in fact the directed case of Model 1, for k = 1 is identical to both the copying model and preferential-attachment. There are many open questions including: 1. In the case of the directed walk with self-loops, we can analyze the expected virtual degrees and provide some concentration bounds, but do not have a formal proof that the virtual degrees necessarily follow a power-law. Furthermore, even assuming this is the case, we do not have a proof that this implies that the actual degrees must be power-law, though our experimental results show this to in fact be the case. Thus, can one give a formal proof that the degrees indeed follow a power law for this model?
2. For the case of the undirected walk with self-loops, we know that as p goes to 0, this walk approaches the preferential-attachment distribution. However, experimentally, even for p = 1/2 the degrees follow some heavy-tailed distribution. Can one give a formal analysis of the degree distribution in this case?
3. Finally, another issue brought out by this work is that differences between degrees of nodes in the (real) web graph may not necessarily be due to a distinction in quality, but rather just the result of a random walk   process. Thus, if one is using degree as a measure of quality, one may just be picking out nodes that have been around the longest. Instead, some measure that examines the degree of a node relative to what one would expect given the time the node has been in the system might be more appropriate.