**5. Insights from probability and statistics**

Probability and statistics can give insight into experiments or scientific studies whose outcomes are numbers. For example, if one tosses what one thinks is a fair die many times, for each toss one can count the number of dots that appears. One can add these numbers up and divide by the number of tosses, getting the *mean* or average number of dots that appear. The mean is an example of a *measure of central tendency*, a number which represents the "typical" outcome from rolling the die. (Other examples of measures of central tendency are the mode and median.) In addition to computing a number which gives a measure of central tendency for data, one can measure how far spread out the data are from, say, the mean. Such a number is often called a *measure of dispersion*. Measures of dispersion attempt to give insight into how spread out the data are about the number which expresses its central tendency. One very commonly used measure of dispersion is the *variance* of the data (or the related standard deviation). The variance for a data set from its mean is computed by finding the difference from the mean for each measurement, squaring these numbers and dividing the sum of these numbers by the number of measurements. Thus, the variance is the average "distance" that data values are from the mean. Using these two numbers one gets a surprising amount of information about the data involved.

One can also draw a graph of the frequency with which outcomes occur. These distributions have characteristic shapes depending on the processes involved in the way data are generated. For discrete situations the sum of the "probabilities" of the outcomes adds to one. For situations where there are a non-discretely infinite collection of outcomes, one requires a curve the area under which (i.e. between the positive valued curve and the axis) is 1. For a fair die, one would think that each outcome would occur about equally often and that the "distribution" of the outcomes is *uniform*. For other processes the distribution of the outcomes follows the so-called *Gauss curve, normal distribution, *or* bell-shaped curve *(scaled so that the area under the curve is 1):

Mathematicians have discovered many distributions other than the uniform distribution and the normal distribution. Among these are the Poisson distribution, the exponential distribution, and the Erlang distribution. Each of these distributions arises from a particular class of applied problems. For example, the Erlang distribution finds many applications in the field of telephony.

In a complex social network it is natural to consider the valences of the vertices in the network. We can write down the frequency with which each valence occurs in the graph and divide by the number of vertices in the graph. We thereby obtain the relative frequencies of the valences of the vertices. For example, one might look at the undirected graph which represents long distance telephone traffic for a single day. More specifically, the vertices represent phone numbers and two vertices are joined by an edge if a phone call was placed in either "direction" between these two numbers that day. In many complicated networks it has been empirically observed that the cumulative distribution of the degrees of the vertices obeys a power law. This means that the probability that in such a network there will be a vertex of degree greater than or equal to *x* is given by *x*^{-a} where *a* is a constant. M.E.J. Newman (U. of Michigan) reports that on data on graphs from power grids, protein interactions, citations, mathematical collaborations, and other very varied networks the cumulative degree distribution displays power law behavior. Sometimes networks which exhibit a power law distribution for some feature of the networks are also known as scale-free. The idea here is that a function is called *scale-free* if it obeys *f*(*ax*) = *bf*(*x*) where *a* and *b* are constants.

The following is an idea of a statistical kind that has been put to fruitful use. There are 13 different possible situations which can arise for a directed graph with three vertices which has a connected underlying graph (i.e. the graph obtained by disregarding the arrows is connected). These are illustrated in the diagram below:

Figure 1

One empirical approach to complex networks is to examine the connected underlying triples and compute the statistical distribution of the different types of triples; that is, the percentage of each type of (connected) triple for the different triples that appear in the graph. One now can try to see what networks which have similar distributions of triples have in common.

A similar approach can be given for connected 4-vertex subgraphs of large graphs. The six possible subgraphs of this kind are shown below :

Figure 2

(Note that in the bottom right graph in Figure 2 there are two edges that intersect at a point that is not a vertex of the graph.)

Using the data one collects on the basis of looking at the distributions of subgraphs of different kinds, one can try to cluster the networks that have different distributions and see what features might explain their similar characteristics.