The topology of large Open Connectome networks for the human brain

The structural human connectome (i.e. the network of fiber connections in the brain) can be analyzed at ever finer spatial resolution thanks to advances in neuroimaging. Here we analyze several large data sets for the human brain network made available by the Open Connectome Project. We apply statistical model selection to characterize the degree distributions of graphs containing up to nodes and edges. A three-parameter generalized Weibull (also known as a stretched exponential) distribution is a good fit to most of the observed degree distributions. For almost all networks, simple power laws cannot fit the data, but in some cases there is statistical support for power laws with an exponential cutoff. We also calculate the topological (graph) dimension D and the small-world coefficient σ of these networks. While σ suggests a small-world topology, we found that D < 4 showing that long-distance connections provide only a small correction to the topology of the embedding three-dimensional space.

where L is the likelihood function,v the model parameters that maximize L, and K is the number of parameters. The difference between AIC and BIC is the prefactor a in the last term, a = 2 for AIC, ln(N ) for BIC, (S2) where N is the number of data points. For N ≥ 8, the BIC gives thus a larger penalty than the AIC for every additional parameter in the model. BIC-based model selection is known to be consistent in the sense that it almost surely selects the true model if it is among the candidates and N is infinitely large [1]. AIC-based model selection is generally not consistent, but unlike the BIC it is, in statistical parlance, efficient: in the limit N → ∞ the AIC selects a candidate model with the smallest Kullback-Leibler divergence from the true model, even if the true model is not among the candidates [2,3]. The true model is in most biological applications indeed unlikely to be included in the candidate set because it usually contains a large number of unknown parameters, which may furthermore increase with N . The consistency of the BIC and the efficiency of the AIC are asymptotic statements valid for N → ∞. While it is insightful to know under which conditions the AIC or BIC are asymptotically optimal, these properties are ultimately of theoretical rather than practical relevance for finite data sets. The best guidance which criterion to choose comes from simulation studies involving finite data.
With a random number generator we have created finite data that pose a similar challenge to the model selection algorithm as the observed degree distributions. The data are N = 10 6 random numbers, independently sampled from the distribution where k obs is the generated random integer and the prefactor on the second line is C(k c , α, β) = (k c + α) β exp −β(k c + α) −1 k c . The tail of this distribution is a power law. If k were real-valued rather than an integer, the right-hand side of Eq. (S3) would be differentiable everywhere. We have chosen this distribution for two reasons. First, we want to test whether it is possible to determine with information criteria that the true model is a power-law (POW in Table 1). Second, we want to find out which criterion is better at determining the precise crossover point k c in the presence of random fluctuations. The motivation is that, if the brain were indeed scale-free, we would correctly identify the power-law model and its parameterization. We fixed α = 10, β = 2.5 and allowed k c to vary.
For our first exploratory Monte Carlo simulations, we compared whether AIC-or BIC-based model selection is more successful at determining the true k c from the POW candidate models where α, β and A 1 , . . . , A kc are adjustable parameters. The crossover point k c is not itself a parameter in Eq. (S4), but rather an index for different candidate models. We summarize the results in Fig. S1. The AIC reconstructs the crossover point almost accurately in the entire range 0 ≤ k c ≤ 150. The BIC underestimates k c , especially when k c is large. The reason lies in the BIC's steeper penalty for additional parameters in Eq. (S2). Another obstacle for the BIC is that the true model (i.e. the distribution of Eq. S3 with the three parameters α, β and k c ) is generally not one of the candidates of Eq. (S4) because those contain more parameters than necessary, in particular one parameter A i for every 0 < i ≤ k c . The AIC, by contrast, correctly chooses a candidate that still describes the input best among the wrong (i.e. overparameterized) models: for k > k c the distribution decays ∝ (k + α) −β whereas there is no power law in the region 0 < k ≤ k c .
Having found that the AIC is better suited for model selection in the present context, we investigate next whether the AIC reliably identifies the correct model among those listed in Table 1. We generate 10 independent sets of N = 10 6 random numbers each. These are drawn from the distribution of Eq. (S3) with α = 10, β = 2.5, and now we also fix k c = 100. For 8 out of these 10 sets, the power law had the smallest AIC. In the remaining two, the power law's AIC differed from the minimum only by ∆ POW = 0.05 in one case and ∆ POW = 0.30 in the other case. As we have explained after Eq. (5), such small ∆ POW are still interpreted as substantial empirical support for the power law. We conclude that, if the connectome's degree distribution were scale-free, the AIC would have correctly identified a power law as a plausible candidate model.
We have tested whether this statement remains true even if the observed network is not the full scale-free network, but only a random sample of 20% of the edges. Such subsampling of the true network might mimic that the data are collected at a finite resolution and therefore only a fraction of the fiber tracts might be detected. For a fixed power-law degree distribution (α = 10, β = 2.5, N = 2 · 10 6 , k c = 0), we first construct a corresponding network with the configuration model [4]. We then randomly select 20% of the edges. The degrees of this subgraph have the distribution [5] Here k true is the true degree of a node (i.e. in our model a power-law distribution POW as in Table 1 of the main text) and k obs is the degree observed in the sampled network. We removed all nodes with degree zero from the network to be consistent with the procedure used by the OCP.
Applying AIC-based model selection with the candidates of Table 1, we find that the differences between the heavytailed distributions POW, LGN, TPW and GWB are smaller for the sampled network than for the full network. In ten test runs, we could always rule out EXP, WBL and in one case LGN because ∆ > 10, but otherwise ∆ was below this threshold. Still, POW had the smallest AIC in eight out of ten cases and came in a close second in the remaining two. In summary, AIC-based model selection would have correctly included the power law in the set of plausible candidates even if the observed network had been a small sample of a network with a scale-free degree distribution.
There is one fine detail to note when the dimension K of the model approaches the sample size N . In such cases the AIC is a negatively biased estimate of the Kullback-Leibler information [6]. An approximate correction for this bias is the additional penalty term 2K(K+1) N −K−1 in Eq. (4). This modified version of the AIC is commonly abbreviated as