Introduction

The large volumes of data obtained by recent technological developments, such as next-generation sequencing and expression profiles, are providing novel and complementary ways to studying biological systems. In order to extract meaningful, statistically significant information from such data, mathematical methods are being developed, implemented and tested in various contexts. For example, it is believed that most tumors are due to somatic mutations that lead to an uncontrolled cell growth. Next-generation sequencing technologies produce hundreds of gigabases of genetic data, providing a way to identify genes responsible for the tumorigenic process by comparing the genome of the tumor and the normal tissue1,2,3,4,5,6,7.

In this note, we point out some interesting properties of the ratios of natural numbers obtained in a biological/clinical setting. The ratios of interest can be seen as sampled from a distribution over the rational numbers in the unit interval. Consider pairs of positive integers, n and m, sampled from a distribution with probability f(n, m). The ratio q = n/(n + m) of one of these numbers by the sum of the two is a rational number in the unit interval. In this way the distribution f(n, m) gives rise to a distribution g(q) supported on the rational numbers in the unit interval. A case of particular interest is when the two integers are drawn independently from the same distribution h(n). As we are going to see, in this case and for h being certain common distributions, such as exponential and power-law, it is possible to have a closed-form expression for g. We will also see that the resulting distributions over the rational numbers possess certain self-similarity properties. Namely, the overall shape of those distributions is similar to Thomae's function (Figure 1, top left). Although irrelevant to our discussion we would like to point out that, similar to Thomae's function, the distributions which we study are rather interesting analytically, because, viewed as functions over the reals, they are continuous on the irrational numbers but not on the rationals.

Figure 1
figure 1

Thomae's function, a self-similar function over the rational numbers in the unit interval (top left).

The human genome is diploid with two strands per chromosome. The reads covering a position of the genome can originate from each of the four strands (top right). For every position, the ratio between the number of reads from one of the strands to the total number of reads from the chromosome and the ratio between the number of reads from the chromosome to the total number of reads covering the position are rational numbers. The distribution of each of these ratios follows a self-similar distribution (bottom).

We will illustrate the appearance of such distributions in real life data with two examples: 1) a next-generation sequencing experiment aimed at identifing genomic variations in cancers and 2) diagnosis data collected at the New York Presbyterian Hospital in several consecutive years. Although the presence of irregular shapes and spikes in empirically occuring distributions of ratios of natural numbers was reported before as a statistical artifact8, the authors of this previous work failed to acknowledge the interesting mathematical structure of the underlying distributions. In this work we propose the study of those naturally occurring distributions of rational numbers as an interesting mathematical topic with important clinical and biological applications.

Results

First example: identifying genomic alterations with next-generation sequencing

Our first example comes from a next-generation sequencing experiment of a diffuse large B-cell lymphoma (DLBCL) sample6,7. DLBCL is the most common B-cell non-Hodgkin lymphoma in adults, accounting for ≈40% of all new lymphoma diagnoses. Tumor DNA was extracted from a nodal tumor of a 63 year old female patient. The coding part the genome (the exome) was enriched using Roche NimbleGen Sequence Capture and the enriched product was sequenced using Roche 454 sequencing. The data produced from the experiment were 2 · 106 reads (sequences of DNA) of average length 250 nucleotides. The reads were aligned to the hg18/NCBI36.1 reference human genome. This resulted in a coverage of about 10x of the human exome and the alignment was used to identify genomic variants distinguishing normal and tumor cells. Figure 1 (top right) shows a diagram of the alignment algorithm and the fractal-like distributions obtained from the sequencing experiment (bottom).

Figure 2 (top, blue) shows the depth ( = number of reads covering a particular position) distribution (coverage) after alignment of the reads. The figure also shows a negative binomial least-square fit of the data. If the reads were obtained from the genome independently and at random, one would expect the coverage to follow a Poisson distribution. As it is, even though restricted to a small part of the genome the coverage might be Poisson, overall, because of the way the sample was processed before sequencing, the means of the Poisson processes in different parts of the genome will vary. The result will be an overdispersion of the depth distribution and a better fit by the negative binomial, known to be a mixture of Poisson distributions with Gamma-distributed means.

Figure 2
figure 2

Coverage in the cancer sequencing experiment (top).

Coverage of the two copies of the cancer genome (bottom left). Coverage of the two strands of a fixed copy of the cancer genome (bottom right).

Each of the 46 chromosomes of the human genome has two strands and, with the exception of the sex chromosomes X and Y, the human genome is diploid, i.e. each chromosome has a homologous copy. Since the reference genome is given as entirely haploid, the information about which copy of the genome a sample read originates from is not recovered by the alignment. Nonetheless, assuming that a read can originate from each copy of the genome with equal probability and given the coverage of the reference, one can obtain a theoretical coverage of a fixed copy of the genome. Thus the fraction of positions on a fixed copy of the genome covered with k reads is

where q(t) is the fraction of positions with coverage t, as given in Figure 2 (top, blue). After a simple algebraic simplification it can be shown that, if q is Poiss(λ), then p is Poiss(λ/2). Furthermore, since the negative binomial is a mixture of Poissons with Gamma-distributed means, we can obtain that if q is NegBin(r, s), then p is NegBin(r, (s/2)/(1−s/2)). Figure 2 (top, green) shows the theoretical coverage of a fixed copy of the human genome obtained from these considerations. Similar reasoning leads us to a predicted coverage of a fixed strand of the human genome shown in Figure 2 (top, black).

Although the alignment to the reference does not provide exact information about the origin of a read in the sample, we can still test the prediction about the coverage of a fixed copy of the cancer genome in the following way: take sufficiently many heterozygous positions, i.e. positions at which the two copies of the genome differ and then consider the number of reads covering such a position and containing one of the variants at that position and the number of reads containing the other variant. Those two depth distributions should be close to the predicted distribution of the coverage of a fixed copy of the genome. Figure 2 (bottom left, blue and red) shows the result of these considerations. Here we took only the positions of exonic single nucleotide polymorphisms documented in the NCBI's dbSNP database, which are covered sufficiently well in the experiment (total of ≈3000 heterozygous positions). Figure 2 (bottom left, green) contains the predicted coverage of the two copies of the human genome as obtained earlier. Furthermore, Figure 2 (bottom right) shows similar plots for a fixed strand of the genome. Since the information about the strand from which a sample read originates is also lost in the sequencing, here we used the orientation of a read when aligned to the reference as a surrogate for its strand. As can be seen, the predictions closely follow the data, confirming our intuition that the reads come from the four strands of the genome independently.

Our main observation is concerned with the heterozygous positions we used to obtain the data for Figure 2 (bottom). This time we consider the distribution of the ratios of the number of reads covering one of the variants at a particular position in the cancer genome to the total number of reads covering this position and the ratio of the number of reads covering one of the strands to the total number of reads covering the variant. The resulting distributions of ratios are given in Figure 1 (bottom, blue). There are two apparent features of the distributions which drew us to studying them: first, their fractallike self-similar structure and second, the spikes they contain. We consider the topic of the self-similarity of the distributions in the Methods section and quantify it by computing the fractal dimension of related functions. From a biological point of view the spikes are interesting because at first sight one might decide that they show overrepresentation of certain ratios. For example, for the distribution of variant depth over the total depth, the spike at 0.5 is expected, since we are looking at heterozygous positions, but the spikes at 0.33 and 0.66 are harder to explain biologically since they would mean the significant presence of variants with ploidity other than 2. While such phenomena can occur in cancers because they can present genome aberrations known as copy number alterations, the scale at which the phenomenon is represented here is unusual. We will see that the spikes are due to the discreteness of the data and could actually be explained by a simple stochastic model. Hence regarding the biological conclusions one can draw from next-generation sequencing experiments, the message of our note is that when dealing with biological data the stochastic effects due to the discreteness of the data can be big and attention should be used when drawing conclusions lest one confuse such effects with real biological phenomena. A similar conclusion was drawn in8. In this note we further study the mathematical properties of the resulting distributions.

To formalize the situation we first define the convolution over the rational numbers of two functions defined over the natural numbers. Let

be the set of rational numbers in the unit interval. For any two functions define their convolution to be

In Figure 1 (bottom left, red) we have also plotted the convolution cp,p of the negative-binomially distributed predicted coverage p of the two copies of the cancer genome as given in Figure 3 (bottom left, green). In Figure 1 (bottom right, red) we have done the same for the coverage of a fixed strand. As can be seen, the convolutions follow closely the empirical distributions of ratios. This observation is consistent with the null-hypothesis of reads originating from the four strands of the human genome independently and covering a particular position on the genome with a negative-binomial distribution. No further assumption seems to be necessary to explain the irregular shapes of the ratio distributions.

Figure 3
figure 3

Comparing the co-morbidity of various conditions with the 2009 H1N1 pandemic versus seasonal influenza.

We would like to finish the exposition in this section by noting that the observed structures are not particular to the Roche 454 sequencing technology and can be observed in sequencing experiments performed with other sequencing platforms, e.g. Illumina's Solexa and Life Technologies' SOLiD.

Second example: electronic clinical data

The development and implementation of electronic clinical records has made available large amounts of longitudinal clinical data. The primary application of electronic clinical data is to improve the quality of health care provided to the individual patients9. Although using this data for uncovering large scale correlations and trends comes secondary to this, the impact such data mining will have on the public health is indisputable10. Some specific areas which will be influenced by such analyses are the creation of alert systems for emerging infectious diseases, identification of populations at risk and measuring the efficacy and efficiency of public health measures. A recent example of this is provided by the 2009 H1N1 influenza pandemic. The first wave of the new influenza strain infected a considerable part of the world population at the end of spring 2009 and the beginning of the summer 201011,12. Evaluating the impact of the new pandemic strain on the public health involved analyzing large clinical datasets13,14,15.

The New York Presbyterian Hospital has an electronic repository with the longitudinal clinical records of more than 2 million patients. An example of the large scale analysis enabled by this data is the identification of populations that are at higher risk of morbidity/mortality from the new pandemic influenza virus versus seasonal influenza, for instance, people with asthma, children, pregnant women, etc15. The approach we took for this analysis was to compare the number of people with a given condition who were affected by seasonal or pandemic influenza at different time points. Towards this goal, for every two diseases identified by their ICD9 codes, we can obtain from the electronic health records the number of people who have been affected by both diseases. Although this might differ from the established terminology, for the purpose of this note we will call this number the co-morbidity of the two diseases. In this way for a fixed disease we can obtain its co-morbidity with all other possible diseases. If we do this for two diseases, which in our analysis we take to be seasonal and pandemic influenza, we can then compare the sets of co-morbidities and look for conditions enriched with respect to one of the diseases but not the other. Figure 3 (top left) shows the distribution of co-morbidites with seasonal and pandemic influenza. As can be seen, these distributions are long-tailed and can be modeled with power-law distributions. The results of the power-law fits are also shown in Figure 3 (top left).

For a particular health condition, an important measure of the risk of being infected by seasonal versus pandemic influenza for people who have had this condition is the ratio of the number people who have had both that condition and seasonal influenza, i.e. the co-morbitity with seasonal flu, to the total number of people who have had the condition, i.e. the sum of the co-morbidities with seasonal and pandemic flu. We have plotted the distribution of these ratios in Figure 3 (top right, blue). As can be seen, its shape has the self-similar structure of interest to us. From the discussion so far one might be tempted to model this distribution as the convolution of the power-law distributions modeling the two sets of co-morbidities. The result of this attempt is shown in Figure 3 (top right, green). The graph shows that in this case the convolution is not a good model because the empirical ratios are shifted to the left, wheres the convolution is not. In Figure 3 (bottom) we have plotted the pairs of co-morbidities for all conditions. The Spearman correlation coefficient for the two sets is 0.83 and linear regression shows that the co-morbidities for pandemic influenza are 1.3 times the corresponding co-morbidities for the seasonal influenza. Hence one might suppose that the discrepancy is due to the fact that the pairs of co-morbidities are not independent – the convolution defined above assumes that the two distributions are independent.

To avoid this obstacle we reconsidered our model for the distribution of co-morbidities and asked the following question: what is the source of the long-tail of this distribution? Our stipulation is that 1) for a fixed pair of diseases the co-morbidity is Poisson distributed, if you observe it at different time points; 2) the means of these Poissons vary from pair to pair of diseases; and 3) the distribution of these means is long-tailed. The first two stipulations are trivial if one accepts the simplifying assumption that for every disease (or pairs of diseases) there is a fixed probability that a particular person will get afflicted with this disease at a particular moment. The third stipulation is supported by our experience with the electronic health records and is akin to the informal observation that there is no universal scale at which diseases happen in the human population. We use that the mixture of Poissons with power-law distributed means has a power-law distributed tail (see the Methods section) to model the long-tail distribution of the two sets of co-morbidities. In Figure 3 (top left, black) we have plotted the result of a mixture of Poissons with power-law distributed means.

Next we claim that the observed distribution of ratios is a mixture of convolutions of pairs of Poissons where the mixing is with the same power-law distribution used for the distribution of co-morbidities. More precisely, let's say that the co-morbidity of a fixed condition with seasonal influenza is Poisson with mean λs and its co-morbidity with the pandemic strain is Poisson with mean λp . From our observation on the dependance between the two sets of co-morbidities, we can say that λp = γλs for some γ. Hence the risk ratio of this condition with the two kinds of influenza will be distributed according to the convolution of the two Poissons, which we denote with . Since the mean of is λs /(λs + λp ) = 1/(1 + γ) (see the Methods section), for γ ≠ 1 this mean will be shifted away from 1/2 depending on γ. Our model of the distribution for pairs of co-morbidites is a power-law mixture of distributions choosing the two co-morbidities independently according to two Poissons, i.e.

where gα (λ) λ −α. Note that although f(n, m) is not a product distribution, i.e. its marginals are not independent, it is a mixture of such distributions. Finally, the distribution of risk ratios is given by

Figure 3 (top right, green) shows the result of these considerations. We observe a good fit between the empirical distribution to the right of 1/2 and the new model and the predicted overall shift of the model to the left. The apparent discrepancy between the empirical and the mixture model for ratios less than 1/2 can be attributed to the discrepancy at low co-morbidities between the mixture and empirical co-morbidity distributions observed in Figure 3 (top left). Since the goal of this note is to give examples of and draw attention to the interesting self-similar distributions appearing in empirical data, rather than to explore one particular example in detail, we leave the further analysis of the distribution of co-morbidities and the risk ratios derived from them to a future work.

Closed form for the convolution

As a step towards understanding the mathematical properties of functions over the rational numbers in the unit interval obtained as the convolution of functions over the natural numbers, we attempted to obtain a closed form, i.e. in terms of known functions, for some of them. Ideally, given the considerations above, it would be interesting to obtain a closed form for the convolution of two negative binomials or two Poissons. Although we were not able to obtain a closed form in those cases, in the Methods section we present a general method for computing arbitrary moments of the convolution when moment generating functions are available. The most general class of distributions for which we were able to obtain a closed form is power-laws with geometric cut-off. Note that the power-law and the geometric distributions belong to this class and it is known that the negative binomial is a sum of geometric distributions.

Let g be the probability mass function of a variable distributed according to a power-law with geometric cut-off with parameters α, β ≥ 0 such that β > 0 or α > 1, i.e.

where is the polylogarithm function. In particular

Then

Power-law

Take β = 0 and α > 1. Then

Geometric

Take α = 0, β > 0. Then

Uniform

Although this example does not present a distribution appearing naturally in the discussion above, we believe it is fundamental enough to mention here. Furthermore, as discussed in the Methods section, this example is related to Thomae's function, because a certain infinite analogue of it has the same fractal dimension.

For a natural number L let fL be the probability mass function which is uniform on the set {1, 2, …, L}, i.e.

Then

Thomae's function

This function, supported on the rational numbers in the unit interval, is not a distribution. It is a classic example of a function which is constant almost everywhere and yet discontinuous on a dense set. It can be beautifully interpreted as the view from the corner of Euclid's orchard – an imaginary orchard which contains a tree at every point with integer coordinates. Although it probably is not the convolution of functions over the natural numbers, the fact that versions of it appeared in our empirical data was a pleasant surprise to us and one of the main motivations for this study. In the Methods section we will show that the graph of this function has a fractal dimension 3/2.

Discussion

We have presented a set of self-similar distributions supported on the rational numbers in the unit interval. These functions appear pervasively in the analysis of large datasets when models for the distribution of ratios of natural numbers are required. The examples presented in this manuscript are drawn from next-generation sequencing data obtained as part of a study on the identification of somatic mutations, on one hand and understanding disease co-morbidity as it is reflected in electronic clinical data, on the other. One can envisage further applications in clinical and biological settings in which the estimation of a frequency or ratio is necessary. Such examples are provided by the detection of subclonal populations in tumor samples, e.g. as part of a study on resistance to chemotherapy; the study of quasi-species and intrahost viral populations, e.g. in HIV and influenza; and studies of drug effectiveness, populations at risk in a pandemic and other topics in clinical research approachable through the analysis of risk ratios. We hope that our presentation will stimulate further study of the functions presented here and provide a bridge between interesting theoretical work and important clinical applications.

Methods

Fractal dimensions

The distributions we considered in this note exhibit a self-similar fractal structure. We are interested in calculating the fractal dimension of those structures. More precisely, given a function , define G(f ) to be the set of line segments in the plane from (q, 0) to (q, f(q)) for . The fractal dimension of the set G(f ) is defined as

where N(ε) is the number of squares of size ε needed to cover G(f ). If f is such that , e.g. f is a probability distribution, then dim G(f ) = 1. Hence, our attention will focus on the fractal dimension of more general non-normalizable functions defined on the rational numbers in the unit interval.

For a given α ≥ 0, let

From the discussion on the closed form for the convolution follows that for α > 1, fα is normalizble and hence, in this case, dim G(fα ) = 1. Also trivially dim G(f0) = 2. It will be interesting to obtain dim G(fα ) for α (0, 1]. The following calculations from16 should be helpful in obtaining this dimension.

Let be Thomae's function fT (a/(a + b)) = 1/(a + b). We will show that dim G(fT ) = 3/2. Since max{a, b} = Θ(a + b), one can think of Thomae's function as the infinite analogue of the convolution of the uniform distribution on {1,…, L} extended to L = ∞.

Let Fn be the n-th Farey sequence, i.e. is the sequence of all rational numbers , such that ai and cin, sorted in increasing order. Let be the area of the trapezoid between the x-axis and the line segment with points (xi−1, fT (xi−1)) and (xi , fT (xi )). Then

where we use that xixi–1 = 1/ci−1ci .

Let be the area under the piece-wise linear curve with points from Fn . We will calculate AnAn–1 for n ≥ 3. Consider two consecutive members ai−1/ci−1 and ai /ci of Fn−1, which have an element yj = (ai−1 + ai )/(ci−1 + ci ) of Fn inserted between them. Then ci−1 + ci = n and

For every n > a > 0 if d = (a, n) there exist unique 0 < n′ < n and 0 ≤ a′ < a such that d = (a′, n′), n′aa′n = d2, a′ < n′ and a″ = aa′nn′ = n″. If , then (a, n) = 1 and we have that a′/n′, a″/n″ Fn−1 are consecutive and a/n Fn is inserted between them. Hence

where we let .

Since A2 = 1 and lim k→∞ Ak = 0 we obtain that

Since Σ b|n bGb = Hn , where Hn is the n-th harmonic number, from Möbius inversion follows that

We are ready to obtain an asymptotic expression for Ak . Namely

Let εk = min i {xixi−1} = 1/k(k−1), where the minimum is over the elements of Fk . We need

squares of size εk to cover the set G(fT ). Hence dim G(fT ) = 3/2.

Let be the sequence of rational numbers , such that a, bk, sorted in increasing order. Using similar arguments as above we can show that the length Lα,k of the curve with points (yi , fα (yi )) satisfies

Let Aα,k be the area under the curve with points (yi , fα (yi )). Furhermore, let δk = min i {yiyi−1} = Θ(k−2) and Nα,k be the number of squares of size δk necessary to cover G(fα ). Since we obtain that for α [0, 1]

We believe that this lower bound is an equality.

Moments of the convolution

In this section we derive an expression for the moments of the convolution of distributions on the natural numbers in terms of their moment generating functions. Using this expression we show that the mean of the convolution of any distribution with itself is 1/2. In the specific case of a convolution of two Poissons with means λ and µ we show that the mean is λ/(λ + µ) and the variance is

where

Consider two distributions and define

Note that the s-th moment of the convolution of f and g is ms /m0. We have that m0 = 1−f(0)g(0) and for s > 0

where χf and χg are the moment generating functions of f and g and integration is over the domain .

If f = g, then

Assume that f and g are Poisson with means λ and µ. Let σ = λ + µ. Then

and

Mixing Poissons

For α > 1 let Mα be a mixture of Poissons with power-law with exponential α distributed means, i.e.

For k > > α−1 we have that