Main

The existence of a power law in the growth of the web not only implies the lack of any length scale for the web, but also allows the expected number of sites of any given size to be determined without exhaustively crawling the web. The distribution of site sizes for crawls by Alexa and Infoseek is shown in Fig. 1. Both data sets display a power law over several orders of magnitude, so on a log–log scale the distribution of the number of pages per site appears as a straight line. This distribution should not be confused with Zipf's like distributions1,2, where a power law arises from rank ordering the variables3.

Figure 1: Log–log plot of the distribution of pages in sites for Alexa and Infoseek crawls, which covered 259,794 and 525,882 sites, respectively.
figure 1

There is a drop-off at approximately 105 pages because server limitations mean that search engines do not systematically collect more pages per site than this. A linear regression on the variables log(number of sites) and log(number of pages) yielded [1.647, 1.853] as the 95% confidence interval for the exponent β in the Alexa crawl, and [1.775, 1.909] for the Infoseek crawl. These estimates for the power-law slope are consistent across the two data sets and with the model, which predicts that β is greater than 1.

In order to describe the growth process underlying this distribution4, we assume that the day-to-day fluctuations in site size are proportional to the size of the site. One would not be surprised to find that a site with a million pages has lost or gained a few hundred pages on any given day. On the other hand, finding an additional hundred pages on a site with just ten pages within a day would be unusual. So we assume that the number of pages on the site, n, on a given day, is equal to the number of pages on that site on the previous day plus or minus a random fraction of n.

If a set of sites is allowed to grow with the same average growth rate but with individual random daily fluctuations in the number of pages added, their sizes will be distributed log-normally after a sufficiently long period of time5. A log-normal distribution gives high probability to small sizes and small, but significant, probability to very large sizes. But although it is skewed and has a long tail, the log-normal distribution is not a power-law one.

Two additional factors that determine the growth of the web need to be considered: sites appear at different times and grow at different rates. The number of web sites has been growing exponentially since its inception, which means that there are many more young sites than old ones. Once the age of the site is factored in to the multiplicative growth process, P(n), the probability of finding a site of size n, is a power law, that is, it is proportional to n−β. Similarly, considering sites with a wide range of distributions in growth rates yields the same result: a power-law distribution in site size. The simple assumption of stochastic multiplicative growth, combined with the fact that sites appear at different times and/or grow at different rates, therefore leads to an explanation of the observed power-law behaviour.

The existence of this universal power law, which is yet another example of the strong regularities6,7 revealed by studies of the web, also has practical consequences. The expected number of sites of any arbitrary size can be estimated, even if a site of that size has not yet been observed. This can be achieved by extrapolating the power law to any large n; for example, P(n2)=P(n1)×(n2/n1)−β. The expected number of sites of size n2 in a crawl of N sites would be NP(n2). For instance, from the Alexa data we can infer that, if data were collected from 250,000 sites, the probability of finding a site with a million pages would be 10−4. This information is not readily available from the crawl alone, as it stops at 105 pages per site.