Defining Individual-Level Genetic Diversity and Similarity Profiles

Classic concepts of genetic (gene) diversity (heterozygosity) such as Nei & Li’s nucleotide diversity were defined within a population context. Although variations are often measured in population context, the basic carriers of variation are individuals. Hence, measuring variations such as SNP of an individual against a reference genome, which has been ignored previously, is certainly in its own right. Indeed, similar practice has been a tradition in community ecology, where the basic unit of diversity measure is individual community sample. We propose to use Renyi’s-entropy-based Hill numbers to define individual-level genetic diversity and similarity and demonstrate the definitions with the SNP (single nucleotide polymorphism) datasets from the 1000-Genomes Project. Hill numbers, derived from Renyi’s entropy (of which Shannon’s entropy is a special case), have found widely applications including measuring the quantum information entanglement and ecological diversity. The demonstrated individual-level SNP diversity not only complements the existing population-level genetic diversity concepts, but also offers building blocks for comparative genetic analysis at higher levels. The concept of individual covers, but is not limited to, individual chromosome, region of chromosome, gene cluster(s), or whole genome. Similarly, the SNP can be replaced by other structural variants or mutation types such as indels.


Concepts and Definitions
Let us start with a brief review on the species diversity (aka community diversity, biodiversity or ecological diversity) to explain the two essential elements of diversity concept in general, which should facilitate the introduction of our SNP diversity and similarity measures below. Species diversity refers to the ecological diversity of species in an ecological community, but diversity concept is equally applicable to genetic diversity (e.g. Nei 1973, Wehenkel et al., Bergmann et al.) 13,23,24 or other entities such as metagenome diversity (Ma and Li) 20 . Conceptually, diversity possesses two essential elements: the variety and the variability of varieties; (Gaston; Chao et al.) 10,25 . For example, the two elements of species diversity are species (variety) and the variability of species abundances. To quantify the concept of species diversity, one surveys a community (usually by sampling), counts the abundances of each species in the community, and obtains p i = (the relative abundance of species i) = (the number of individuals of species i)/(the total individuals of all species in the community), and also counts the number of species in the community (S). The dataset from such a survey (sampling) is a vector of species abundance in the form of (p 1 , p 2 , …, p i , …p s ). For such a vector of relative abundances (frequencies), one approach to characterizing it is to fit a statistical distribution, which is known as species abundance distribution (SAD) in community ecology. The most widely used SADs include log-series, log-normal, and power law distributions; a common property of SADs is that they are highly skewed, long tail distributions, but rarely follow the normal distribution or uniform distribution. Instead, the SAD is highly aggregated (skewed or non-random), just as the non-random SNP distribution previously mentioned in the introduction section. Although SAD fully describes the species abundance frequency and therefore adequately captures the full characteristics of species diversity, using a SAD to measure diversity fails to present intuitive measures to synthesize the two elements of diversity (i.e., variety and variability). An alternative approach to fitting SAD is to use various diversity metrics (also known as measures or indexes). Numerous diversity metrics for measuring species diversity have been proposed, with Shannon's entropy being the most well known.
Diversity metrics belong to the so-termed aggregate functions, which combine several values into a single value (Beliakov et al., James) 6,7 . The arithmetic mean (average) is the most commonly utilized aggregation function, but it is a rather poor metric for measuring diversity due to the highly non-random distribution of species abundances. Instead, entropy-based aggregation function is suitable for measuring diversity. The first and also still one of the most widely utilized entropy-based diversity metric is Shannon entropy, which was attributed to Claude Shannon, the co-founder of information theory; (Shannon, Shannon & Weaver) 8,26 , but Shannon had never studied biodiversity himself. What happened was that ecologists borrowed the idea from Shannon's information theory, in which Shannon's entropy measures the content of information or uncertainty in communication systems. Of course, Shannon's entropy is indeed sufficiently general for measuring biodiversity because diversity is essentially heterogeneity, and heterogeneity and uncertainty both can be measured by the change of information, i.e., information lowers uncertainty. Using Shannon entropy as example, species diversity (H), more accurately species evenness, can be computed with the following formula, where S is the number of species in the community, and p i is the relative abundance of each species in the community. In terms of the variety-variability notion for defining diversity, the variety is the species and variability is the species abundance obviously. In fact, the variety-variability notion can be utilized to define diversity for any systems (not even limited to biological systems) that can be abstracted as the two elements of variety and variability, including SNP diversity, as exposed below.
Definitions for SNP diversities. Using an analogy, a chromosome that has many loci is similar to an ecological community of many species, and each locus may have different number of SNPs. With variety-variability notion for defining diversity, the locus is the variety (similar to species in a community), and the number of SNPs at each locus is the variability (similar to species abundance in a community). Assuming S is the number of loci with any SNP, and p i is the relative abundance of SNPs at locus i (i.e., the number or abundance of SNPs at locus i divided by the total number of SNPs from all loci), then SNP diversity can be measured with Shannon entropy (Eq. 1). Strictly speaking, SNP may also be termed locus diversity, since locus is essentially the 'habitat' where SNPs reside. Figure 1 conceptually illustrated the distribution of SNPs on a chromosome; specifically how p i is defined and computed.
Although Shannon's entropy has been widely used for measuring species diversity, a recent consensus among ecologists is that Hill numbers, which are based on Renyi's general entropy, offer the most appropriate metrics for measuring alpha-diversity and for multiplicatively partitioning beta-diversity (Chao et al. 2012, Ellison 2010, Kaplinsky & Arnaout) 9,10,12,19 . Given the advantages of Hill numbers over other existing diversity indexes, we believe that the Hill numbers should also be a preferred choice for defining the SNP diversity. SNP alpha-diversity. Hill numbers were derived by Hill (1973) based on Renyi's (1961) general entropy 15,16 .
Here we propose to apply it for defining the SNP alpha-diversity, i.e., where G is the number of gene loci with any SNP, p i is the relative abundance (i.e., the frequency of occurrence) of SNPs at locus i, q = 0, 1, 2, … is the order number of SNP diversity, q D is the SNP alpha-diversity at diversity order q, i.e., the Hill numbers of the q-th order.
www.nature.com/scientificreports www.nature.com/scientificreports/ The Hill number is undefined for q = 1, but its limit as q approaches to 1 exists in the following form: The diversity order (q) determines the sensitivity of the Hill number to the relative abundance (i.e., the frequency of occurrence) of SNP. When q = 0, the SNP frequency does not count at all and 0 D = G, i.e., the SNP richness, similar to the species richness in species diversity concept. When q = 1, 1 D equals the exponential of Shannon entropy, and is interpreted as the number of SNPs with typical or common frequencies. Hence, Shannon index is essentially a special case of Hill numbers at diversity order q = 1. When q = 2, 2 D equals the reciprocal of Simpson index, i.e., which is interpreted as the number of dominant or very frequently occurred SNPs. Therefore, two most widely used diversity indexes, Shannon index and Simpson index are the special cases, and more accurately, the functions of the Hill numbers.
In general, we need to specify an entity (unit or scope) for defining and measuring SNP diversity. For demonstrative purpose in this article, we choose individual chromosome as the entity for defining SNP diversity, similar to using community for defining species diversity. The general interpretation of diversity of order q is that the chromosome contains q D = x loci with equal SNP frequency. Note that the entity for defining SNP diversity can be other appropriate units such as the whole genome of an organism or segment of chromosome.
The above-defined SNP diversity measures the diversity of SNP on an individual genetic entity (such as chromosome or genome), similar to the concept of alpha diversity in community species diversity, and we term it SNP alpha-diversity. In the following, we define the counterparts of species beta-diversity and gamma-diversity in community ecology for SNPs, i.e., SNP beta-diversity and SNP gamma-diversity. SNP gamma diversity. While the previously defined SNP alpha-diversity is aimed to measure the SNP diversity within a genetic entity (such as a chromosome or genome), the following SNP gamma-diversity is defined to measure the total SNP diversity of pooled, multiple (N) chromosomes from a population (cohort) of N different individuals, one from each individual but with the same chromosome numbering.
Assuming there are N individuals in a population (cohort), we define the SNP gamma-diversity with the following formula, similar to the species gamma-diversity in ecology (e.g., Chao   where p i is the SNP frequency on the i-th locus (i = 1, 2, …,G) in the pooled population of N individuals (termed N-population). Comparing Eq. (5) for gamma diversity with Eq. (2) for alpha diversity reveals that the gamma-diversity is the Hill numbers based on the SNP frequency at i-th locus in the N-population. Similar to Chao et al. 9, 10 Chiu et al. 27 , derivation for species gamma-diversity in ecological community, assuming y ij is the SNP frequency at i-th locus of j-th individual, y i+ is the total value of SNP at i-th locus contained in the N individuals, y +j is the total SNP from j-th individual, y ++ is the total SNP contained in N individuals, p ij is the SNP frequency at i-th locus of j-th individual, w j is the weight of the j-th individual, Reference , we obtain the following formulae for computing SNP gamma-diversity of N-population as follows:

SNP beta diversity.
In community ecology, there are two schemes for defining beta-diversity: one is the additive partition and another is the multiplicative partition of gamma diversity into assumingly independent alpha-diversity and beta-diversity. ) are alpha-and gamma-diversity measured with the Hill numbers, respectively, beta-diversity is defined as: We adopt the exactly same multiplicative partition of the Hill numbers in species diversity for measuring SNP beta-diversity except that both alpha-and gamma-diversities are computed with SNP frequency (relative abundance), rather than with species abundances.
This SNP beta-diversity ( β D q ) derived from the above multiplicative partition takes the value of 1 if all communities are identical, and the value of N (the number of individuals in the population) when all individuals are completely different from each other (i.e., no shared SNPs).
Although Eq. (2) correctly defines the SNP alpha-diversity, it requires some adaptations to apply for the partition of gamma diversity in order to obtain beta-diversity with Eq. (9). Similar to the derivation for species alpha diversity as demonstrated in Chiu et al. 27 , we can derive the following formulae for SNP alpha diversity in N-population setting, i.e., The computation of SNP beta-diversity can then be accomplished with Eqs. (7)(8)(9)(10)(11), i.e., Eqs. (7 and 8) for gamma diversity, (9) for beta-diversity and (10-11) for alpha-diversity.
We define a series of the Hill numbers for SNP diversity at different diversity order q = 0, 1, 2, … as SNP diversity profile, that is, a series of Hill numbers corresponding to different non-linearity levels weighted differently with the SNP frequency distribution.

The Definitions for SNP Similarities
Similar to previous definition for the SNP diversity based on the Hill numbers, we can also define Hill-numbers-based similarity measures for comparing SNP similarities. We adopted the same mathematical formulae previously used for defining the community similarity measures (profiles) by Chao et al. and Chiu et al. 9,10,27 . In community ecology Chiu et al. 27 showed that the four existing similarity measures, Jaccard, Sørensen, Horn, Morisita-Horn are actually functions of the beta diversity ( β D q ) measured in the Hill numbers. (2020) 10:5805 | https://doi.org/10.1038/s41598-020-62362-8 www.nature.com/scientificreports www.nature.com/scientificreports/ Similar to the previously defined diversity profile, the four similarity indexes we define below form a series of SNP similarity profile. In the following, we define the four similarity measures in the context of N-populations of individuals. A major benefit of using these similarity measures, rather than the beta-diversity directly, is that the similarity indexes are 'normalized' to the range of [0, 1] by the number of individuals (N). If the beta-diversity is directly used to compare the similarity, the beta-diversity of N-population ranges from 1 to N, which make the comparisons being dependent on the number of individuals (N).

Local SNP overlap (C qN ).
The local SNP overlap measure (C qN ) quantifies the effective average proportion of SNPs that are shared across all N individuals: is the SNP beta-diversity at order q computed with Eq. (8), N is the number of individuals in the population. When q = 0, C qN is actually the Sørensen similarity index; q = 1, C qN is the Horn similarity index; q = 2, C qN is the Morisita-Horn similarity index.
Regional SNP overlap (UqN). The regional SNP overlap measure (U qN ) quantifies the effective proportion of shared SNPs in the pooled N-population: when q = 0, this statistic is equivalent to Sørensen similarity measure; q = 2, it is equivalent to Morisita-Horn similarity index.

Demonstration with 1000-Genomes Project
The datasets for the demonstration. We used the SNP datasets obtained through the whole genome sequencing data from 1000-Genomes Project 3,22 . Through a series of bioinformatics analyses, the list of all loci with SNP mutations, and the number of loci with SNP mutations on each chromosome were obtained from the raw sequence reads. Detailed information on sequencing and bioinformatics procedures for obtaining the SNP datasets from the whole-genome sequencing of the DNA samples is referred to 1000-Genomes Project 3,22 . A total of 2504 individuals were sampled from 5 populations: Africa (AFR), Americas (AMR), Europe (EUR), East Asia (EAS) and South Asia (SAS). They characterized in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes 3,22 . The R-codes for computing alpha-diversity, beta-diversity (including similarity) profiles are provided in the online supplementary information (OSI). Table S1 in the OSI (online supplementary information: Excel file) listed the SNP alpha-diversity on each chromosome of each individual from each population (ethnic group) in the 1000-Genomes Project. It contained very basic SNP alpha-diversity for each individual's each chromosome at each diversity order q = 0-4, according to previous formulae for computing alpha-diversity. Table S2 summarized the average SNP diversity (per individual within a population or ethnic group) on each chromosome of each population from Table S1, i.e., averaged across all individuals within a population. Table 1 (the top section) below is excerpted partial results from Table S2 in the OSI to facilitate illustration. The SNP diversity profile (the Hill numbers at different diversity orders) offers an effective tool to assess different mutation profiles on different chromosomes, different individuals in a population, or different populations of a species. Figure 2 illustrates the average SNP diversity on each chromosome for each population, for diversity order q = 0. The graphs for diversity order q = 1-4 are included in Figs. S1-S4 of the OSI. (2020) 10:5805 | https://doi.org/10.1038/s41598-020-62362-8 www.nature.com/scientificreports www.nature.com/scientificreports/ We further compared the SNP alpha-diversity at the chromosome level among five different populations with Wilcoxon tests (Table S3). The bottom section of Table 1 excerpted the summary test results from Table S3 in the OSI. It is shown that extensive differences (58.3-100%) exist among five different populations, and the variability (level of differences) depends on population and diversity order (see bottom section of Table 1, and Table S3 for the details). This demonstrates the power of the Hill numbers in detecting the SNP variability at the chromosome level among different populations (ethnic groups).

Demonstrations of the SNP Alpha-Diversity. Chromosome level SNP alpha-diversity profile.
The previous results of the SNP alpha-diversity at the chromosome level demonstrate at least the following three implications. First, it provides a series of diversity metrics (i.e., the diversity profile) to characterize the mutation profile of an individual's specific chromosome in comparison with the reference genome. This chromosome-level diversity profile is both individual and chromosome specific. If time-series data (e.g., medical records including periodic sequencing of an individual's genome) of the diversity profile for an individual are available, then the dynamics of the diversity profile can provide potentially valuable information on the personalized disease-risk assessment and prediction for the individual. Second, the diversity profile can also be applied to compare the variation patterns of the SNP between two populations (cohorts) as demonstrated previously. Third, the 'unit' for measuring diversity can be other than chromosome, for examples, a segment of a chromosome, or even a gene cluster of specific function(s) (e.g., specific diseases).
Genome level SNP alpha-diversity profile. Table S4 in the OSI (Excel file) listed the SNP alpha-diversity of each individual (i.e., at the whole genome level and computed by combining the SNPs from all chromosomes of an individual's genome) from each population (ethnic group) in the 1000-Genomes Project. Table 2 (the top section) below summarized the average SNP alpha diversity (per individual) for each population from Table S4 in the OSI. Figure 3 illustrated the average SNP alpha diversity at the genome level for each population, for each diversity order from q = 0 to 4. We further compared the SNP alpha-diversity at the whole genome level among

Chromosomes
Populations  Table S2, which, in turn, was summarized from Table S1 for the alpha-diversity of each individual on each chromosome in the 1000-Genomes Project with five ethnic groups including African (AFR), American (AMR), European (EUR), East Asian (EAS) and South Asian (SAS). *Summarized from Table S3: The p-value from Wilcoxon tests for the SNP alpha-diversity between different ethnic groups (populations).
five different populations with Wilcoxon tests (Table S5). The bottom section of Table 2 (which is summarized  from Table S5) shows that extensive differences (70-90%) exist among five different populations, and the variability level depends on population and/or diversity order. At lower diversity order, the percentages were higher (90% for q = 0, and 80% for q = 1), and the percentages were lower at high diversity orders (70% for q = 2-4). This result demonstrates the power of the Hill numbers in measuring SNP diversity and discerning the variability at the whole-genome level. The genome-level SNP alpha diversity profile possesses similar implications as the previous chromosome-level profile. For example, the dynamics of genome-level diversity profiles may offer personalized medicine insights for individuals, as well as epidemiology insights when multiple cohorts are compared. In addition, it also offers a simple but powerful approach to compare the mutations patterns of the genomes from different individuals, or from different ethnic groups. Populations 90 80 70 70 70 Table 2. The mean SNP alpha-diversity at genome level (including all his or her chromosomes) averaged across the all individual in same population (summarized from Table S4 for the alpha-diversity at genome level in the 1000-Genomes Project). *Summarized from www.nature.com/scientificreports www.nature.com/scientificreports/ Demonstrations of the SNP Beta-Diversity and similarities. Chromosome level SNP beta-diversity profile. We demonstrate the computation of SNP beta-diversity with a slightly different scheme from the computation of SNP alpha-diversity. That is, we compute the pair-wise SNP beta-diversity and similarity for the same (numbered) chromosome between any two individuals in the 1000-Genomes cohort. To reduce computational load but still obtain representative results, we randomly sampled 100 individuals from each of the five populations, and the SNP data of 500 individuals in total were used to compute the pair-wise SNP beta-diversity. We computed the averages of the SNP beta-diversity or similarity values of all the pairs sampled (a total of 10,000 pairs for each chromosome for each pair of populations), and reported the mean beta-diversity and similarity on each chromosome for pairs of populations (Table S6). Interestingly, the sex chromosomes exhibited the highest beta-diversity values or lowest similarity values between different populations. Since beta-diversity is defined and computed in pair-wise manner, further statistical significance tests for comparing the pairs are of rather limited biomedical meaning and were omitted.
Genome level SNP beta-diversity profile. Similar to previous genome level SNP alpha-diversity, we also computed the genome level SNP beta-diversity. We again randomly sampled 100 individuals from each of the five populations, and pooled together the SNPs from all chromosomes of an individual to compute pair-wise genome-level beta-diversity between two individuals from two respective populations. A total of 10,000 pairs of the beta-diversity for each pair of populations (e.g., AFR vs. AMR) were computed, and the average of the 10,000 beta-diversity values were displayed in Table 3. These values of beta-diversity and similarity are as expected, e.g., the beta-diversity of two individuals ranged between 1 and 2, and all the similarity values normalized between 0 and 1. Similar to the previous chromosome level beta-diversity, statistical tests for comparing the differences among populations were omitted because of their limited biomedical meaning.
The differences between genetic (SNP) alpha-diversity and beta-diversity are similar with those in community ecology. The latter provides a mean to quantify the differences between two or more individuals, either at chromosome, genome, or even population levels. The similarity profile is simply a more convenient recasting of beta-diversity for comparing different entities (chromosomes, genomes, or populations). www.nature.com/scientificreports www.nature.com/scientificreports/ Summary. SNPs may occur in coding sequences of genes, non-coding regions of genes, or in the inter-genic regions. Accordingly, the SNP diversity defined in this article can be applied separately to the three types of SNP occurrence regions. For demonstrative purpose, we did not distinguish the three types in this article, but all the definitions and computational procedures presented in previous sections can be directly applied to separate measuring of the SNP diversities. The only, but minor, difference would be in the data preparation step, i.e., the preparatory calculation of p i according to the region chosen, either coding, non-coding, inter-genic, or the whole locus.
We demonstrated the SNP alpha-diversity with single chromosome and whole genome as the basic genetic entity for defining the genetic alpha-diversity, respectively, corresponding to the chromosome-level SNP and genome-level SNP alpha-diversity. For beta-diversity, we computed the pair-wise SNP beta-diversity for the same-numbered two chromosomes from two respective individuals, at the chromosome and genome level respectively. In fact, SNP beta-diversity may be computed for multiple (N) individuals, as defined previously. Besides defining and demonstrating SNP diversities, we also defined and demonstrated four similarity measures, all of Unlike beta-diversity, the similarity measures are normalized to [0, 1] and not their ranges are not influenced by the number of entities compared.
As argued previously, defining diversity requires two essential elements: the variety and the variability of varieties; (Gaston, Chao et al.) 10,25 . In the individual-level genetic diversity defined in this study, the variety can be SNP, deletion, duplication, inversion, insertion, translocation, or other mutational types. The calculation of variability of varieties is limited to individual, which is demonstrated with individual chromosome or genome in this article, but can also be region of chromosomes or group of loci, which may be particularly interested in by investigators. To calculate the variability of varieties, a reference genome is usually required, but the calculation does not require a population with more than two individuals. The latter is usually necessary for most existing definitions for the genetic diversity, which might be termed population-level genetic diversity to emphasize the distinction. We believe both types of genetic diversities have their own respective application domains and can even complement to each other.
The idea to use Renyi's entropy 16 for measuring ecological diversity originated more than a half century ago by Hill (1971), but his proposal received little attention until about a decade ago when a handful of ecologists (including Chao, Ellison, Jost etc) reintroduced the Hill numbers and achieved wide successes in community ecology [9][10][11][12]27,30 , which demonstrated the effectiveness and advantages of Hill numbers in assessing and interpreting ecological diversity. Recently Gaggiotti et al. 31 , developed a unifying framework for measuring biodiversity from genes to ecosystems by standardizing on the Hill number at diversity order q = 1, which is a transformation of Shannon diversity index. Their simplification is necessary to develop a more generalized framework, but it does not obsolete the novelty of our work here. This is because, at a specific level (the genome level of an individual), Hill numbers at difference orders (q = 0, 1, 2, …) are still necessary to present a comprehensive diversity profile due to the complexity of the issues involved, as demonstrated in previous sections. Furthermore, as elaborated and demonstrated previously, a unique aspect of the present use of Hill numbers for measuring genetic diversity is that our definitions are at individual level, rather than at population level. To the best of our knowledge, this study introduces the first concept and definitions for the genetic diversity at individual level. The proposed concept and definitions should find new important applications in fields such as personalized precision medicine since they can be readily applied to monitor the change of individual-level mutations. Besides, the concept and metrics should also find novel applications in population genomics because the individual-level genomic metrics provide solid basic units for population-level analysis, which we will demonstrate in a follow-up study.

Data availability
The SNP datasets from "1000-Genome Project" used in this study are publicly available: https://www.internationalgenome.org.   Table 3. The means of pair-wise genome-level SNP beta-diversity and similarity measures between any two individuals from their respective populations.