Hardy-Weinberg equilibrium revisited for inferences on genotypes featuring allele and copy-number variations

Copy number variations represent a substantial source of genetic variation and are associated with a plethora of physiological and pathophysiological conditions. Joint copy number and allelic variations (CNAVs) are difficult to analyze and require new strategies to unravel the properties of genotype distributions. We developed a Bayesian hidden Markov model (HMM) approach that allows dissecting intrinsic properties and metastructures of the distribution of CNAVs within populations, in particular haplotype phases of genes with varying copy numbers. As a key feature, this approach incorporates an extension of the Hardy-Weinberg equilibrium, allowing both a comprehensive and parsimonious model design. We demonstrate the quality of performance and applicability of the HMM approach with a real data set describing the Fcγ receptor (FcγR) gene region. Our concept, using a dynamic process to analyze a static distribution, establishes the basis for a novel understanding of complex genomic data sets.


Introduction to Hidden Markov Models (HMMs)
Hidden Markov models describe stochastic sequences in a comprehensive way. These models are widely used for BLAST sequence comparison and inference of haplotypes in genetic association studies, e.g., by HaploView 30 . We derive these HMMs to represent not only haplotypes with allelic variation but also those with copy-number variation.
Our approach allows the researcher to design the Markov model graph for which transition probabilities are inferred to match genotype data that are determined e.g., by MLPA.
An HMM λ is defined as a quintuple λ = (S, V, A, B, π), containing the following components: • the set of hidden states S = {s 1 , . . . , s n }; • the alphabet of emissions V = {v 1 , . . . , v m }, v 1 , . . . , v m ∈ G; • the transition probability matrix A = (a ij ) ∈ R n×n , with a ij being the probability of a transition from state s i to state s j ; • the probabilities of observed emissions B = (b ij ) ∈ R n×m , with b ij being the probability in state s i for an emission v j . By convention, for each i, the probability is 1 for exactly one b ij ; a hidden state is defined by the emission of the neutral element O; • and the starting probabilities π ∈ R n . By convention, the starting state of an HMM is always s 1 (source) and the final state is always s n (sink). Thus, for initial state probabilities, we set π = (1, 0, 0, . . . , 0). Each Markov path X = (x t ) t=1,...,x T is a sequence of T subsequent states, with x t ∈ S, emits a series of events Y = (y t ) t=1,...,x T , with y t ∈ V .
Let X be the set of all possible Markov paths through a given HMM λ.
A function f that maps a single Markov path X ∈ X to its associated haplotype H ∈ G: f : X → G, X → H = f (X) = T t=1 y t = y 1 + y 2 + . . . + y T Because the emission probabilities are all 1 by convention, the term hidden here applies to the fact that the exact sequence of states, i.e., the Markov path X, and emissions Y is hidden, and only the result of the association function f (X) is visible. Moreover, this function is not injective, allowing one haplotype H to be associated with multiple Markov paths. This possibility has interesting consequences for the distribution of genotypes, which are depicted in Suppl. Tabs. 2 and 3.
A complete genotype G consists of two haplotypes H 1 and H 2 . Thus, we associate a complete genotype with a pair (X 1 , X 2 ) ∈ X 2 = X × X of Markov paths: The current implementation of our approach uses the above definition of HMMs. The two haplotypes are assumed to be independent and are treated independently. Nevertheless, this definition may be developed to reflect dependency structures between haplotypes. However, we decided not to extend the above definition here because this would dilute the focus of the current technical report.

Probability of Markov paths
For a given Markov path X = (x t ) t=1,...,T of length T and defined starting state s 1 , the probability Prob(X|A) given the transition probability matrix A can be calculated as As the two Markov paths in a pair are independent, the probability of the pair (X 1 , X 2 ) is For Bayesian inference of transition matrices, we further seek to identify a matrix representation of Markov paths that can be handled easily. For this purpose, we define a counting matrix as the sufficient statistic C(X) for any Markov path X: Definition 1. The counting matrix C = (c ij ) = C(X) summarizes a Markov path X, with c ij being the count of one-step transitions from s i to s j .
Given a constant A, and Prob(x t = s j | x t−1 = s i , A) =: a ij , we can use the associative property and re-write eqn. 4 as Using exponentiation rules, the counting matrices of multiple independent Markov paths X 1 , X 2 , . . .
can be summarized into one counting matrix C = C(X 1 ) + C(X 2 ) + . . .. The motivation for this cumulative counting matrix is described below under Bayesian inference of the transition matrix.

Likelihood function for single genotypes
Let X 2 = X × X be the set of all possible Markov path pairs; then, the set X 2 G ⊆ X 2 is the subset of all Markov path pairs that would produce a genotype G as the sum of combined emissions.
as the association function defined in eq. 3.
The likelihood function Prob(G|A) for a single genotype G is given as the sum of probabilities of all 3 alternative Markov paths pairs (X 1 , X 2 ) ∈ X 2 : Depending on the Markov model design, X 2 G may become so large that this likelihood function becomes computationally intractable. Implementing an expectation maximization algorithm is not practically feasible. Therefore, and because it provides probability distributions for model parameters, we decided to use data-augmented Metropolis-within-Gibbs sampling.

Definition of a conjugate prior for the transition probability matrix
The presented algorithm for Bayesian inference of HMMs takes advantage of the possibility to define an exact distribution function for the transition probability matrix. This functionality allows for the performance of Gibbs sampling of the transition probability matrix A.
Given counts c of categorical events, which are distributed multinomially, Jeffrey's prior, a Dirichlet distribution Dirichlet(α) with vector parameter α = (0.5), is typically used to calculate the exact posterior distribution of associated categorical probabilities. In the present approach, a Dirichlet distribution is used to calculate the exact posterior distribution of probabilities for one-step transitions from one state of the Markov model to the subsequent states. Thus, we can combine the probability distribution for each state into a joint distribution of independent Dirichlet distributions.
Here, we define such a distribution as MatrixDirichlet, for which we use the matrix parameter Ξ (Def. 3). Each row vector of Ξ is the parameter vector used for one of the Dirichlet distributions.
In this way, the set of valid Markov model transitions is encoded by the distribution parameter Ξ.
Due to the Markov property, the probability distributions for separate states are independent from each other. Thus, we can define the probability density function of the complete transition probability matrix as the product of the probability density functions of each row. However, a standard Dirichlet distribution would not be defined by a parameter vector containing zeros. We hence re-define the Dirichlet density functions for each row A i of the transition probability matrix A (Def. 4).
Together, we define the probability density function for the MatrixDirichlet distribution as Def. 5.

with Ξ being the matrix parameter and
A a transition probability matrix.
This approach allows to specify the user input λ 0 for the HMM inference as in Def. 6.
Definition 6. The user input for HMM inference λ 0 = (S, V, Ξ, B, π), with Ξ as the matrix parameter for the conjugate MatrixDirichlet distribution.
Let C be the summarized counting matrix representing the family of Markov paths pair realizations compatible with a genotype data set F = (G i ) i=1,...,N of N individuals. Furthermore, let Ξ (0) be a matrix prior parameter for the MatrixDirichlet distribution of the transition probability matrix A.
Then, the posterior distribution Prob(A|C) can be determined by eqn. 13. This equation is equivalent to an update of the row-wise Dirichlet distributions.

Details on sampling of Markov paths
For the sampling of Markov paths with a given transition probability matrix A, we combine two proposal algorithms: a Gibbs-like sampling approach and a random-walk sampler (squirrel algorithm), which is based upon a recursive tree search. Both proposal algorithms have specific strengths and weaknesses that crucially depend on A. In both algorithms, Markov path pairs for individual genotypes G i in the data set F are sampled separately.
As prior distribution for Markov path pairs, we apply a uniform discrete distribution: with X 2 Gi as the set of Markov path pairs (X 1 , X 2 ) that are compatible with genotype G i .

Sampling and filtering of random Markov paths -drop-out or incomplete Gibbs sampling
Given a transition probability matrix A, the Markov path pair (X 1 , X 2 ) associated with each genotype G i in the observations can be simulated by a modified Gibbs sampling, which we denote here as incomplete Gibbs sampling. The idea is to produce (X 1 ,X 2 ) pairs of Markov paths by walking the Markov model repeatedly in a random fashion according to the transition probability matrix A until For practical purposes, this process is stopped after a fixed number n of repetitions, and an identity kernel is used for the proposal: (X 1 ,X 2 ) = (X 1 , X 2 ). By this mechanism, the general idea of Gibbs sampling is not sacrificed: every proposal including the identity proposal is accepted.
If the transitions probability matrix is nearly optimal, it will produce such a Markov path pair with acceptable efficiency. Otherwise, the algorithm might require an unacceptably large number of random samples to provide enough new proposals. To overcome this issue, this algorithm is combined with the below-described squirrel algorithm.
Balanced recursive tree search (squirrel algorithm) Whereas the above-described approach is very effective when transition probabilities are nearly optimal, it may fail to compute truly new proposals instead of identity proposals for all genotypes in the observations, as the number of random samples is restricted for practical reasons. This implementation is based on the property of pseudo-random number generators used to always deliver the same series of random numbers given the same random seed r 0 . The random search then tries to find a new pair of Markov paths (X 1 ,X 2 ) ∈ X 2 G that produce a genotype G (see Def. 2). Let r i be the ith random number that governs the order by which the possible succeeding states of x i are tested.
For implementation, the pair of Markov paths is concatenated, with the final state of X 1 followed by the first state of X 2 .
The elements (X 1 , X 2 ) are then ordered by the relation R r0 , as defined by Def. 7.
Definition 7. R r0 ⊂ X 2 G × X 2 G is an order of X 2 G , which is completely determined by the random seed r 0 .
We now define a Markov kernel K r0 ((X 1 , X 2 ), A) with A = {S (-1) ((X 1 , X 2 )), S (+1) ((X 1 , X 2 ))} for a given order relation R r0 . If with random variables Z t , Z t+1 ∈ X 2 G , this is in fact a palindromic kernel fulfilling the condition of reversibility because of the relation By choosing the random seed r 0 from a completely arbitrary set, this process defines a state-dependent mixture kernel 31 preserving complete reversibility. Using this kernel and a flat prior distribution for 6 (X 1 , X 2 ), we can calculate the acceptance probability p A as follows: Markov chain mixing with this algorithm can be further improved by randomly choosing an n, n ∼ discrete unif orm(1, N ), with n, N ∈ N and proposing the nth predecessor S (-n) = S (-1) (S (-(n-1)) ) or successor S (-n) = S (+1) (S (+(n-1)) ), with S (-1) and S (+1) as defined above.
The squirrel algorithm is unbiased, but it suffers from the combinatoric complexity of Markov paths.
This algorithm therefore may require a large number of cycles until a stationary distribution is achieved.
However, it always delivers a proposal for the Markov chain, a property that makes this algorithm ideal for combination with the incomplete Gibbs sampling algorithm described above.

Determination of the marginal likelihood of an HMM
To calculate the marginal likelihood Prob(F|λ 0 ) of an HMM setting λ 0 (compare 6), we use the method described by Chib and Jeliazkov a two-block Gibbs sampler 18 . Given the observed genotype data F = G 1 , . . . , G N of N individuals, with G i ∈ G, we choose a transition probability matrix A, preferably at high posterior density. The prior density Prob(A) is given by The posterior density Prob(A | F) is estimated as previously described 18 : C (1) , . . . , C (U ) is the sampled series of counting matrices (compare Bayesian inference of the transition matrix ). Because of the computational intractability of the likelihood function, we estimate its value using an approximation by simulating a large number n of Markov path pairs, resulting in the simulated genotype data set Υ * , with n = |Υ * |.
Let G F be the set of different genotypes of the data set: with |G F | being the cardinality of G F Then, taking 0.5 as initial pseudo-counts for each genotype category (compare Jeffrey's prior), the 7 probability Prob(G) of each genotype G ∈ G F in the observed data set is estimated as with |Υ * | → ∞.
The likelihood function f (F|A) is a probability mass function of the corresponding multinomial distribution (eq. 31), with x i being the number of occurrences of genotype G i in the observed data F, A value 1 is added here to the cardinality of G F to represent a category of elements G ∈ Υ * \F, that are found in the simulated data set but not in the observations. Then, the marginal likelihood is calculated as described 18 :

Determination of marginal likelihood of an NEM
The marginal likelihood of each HMM setting is routinely compared to the marginal likelihood of a NEM M (naive) . The posterior distribution of probabilities p i = Prob(G i |G) of each genotype G i can be described analytically using a conjugate Dirichlet prior (eq. 33).
The marginal likelihood Prob(G | M (naive) ) is then determined as in eq. 35. Suppl. Fig. 1 -Analysis of sampler output for basic HMM for FcγR IIIA. a, state numbering scheme of the HMM graph depicted in Fig. 3c. b, trace of non-normalized density values of 10 separate sampler runs. c, autocorrelation plot of a single sampler run for each varying transition probability. d, trace plot of transition probabilities from 10 separate sampler runs. The plots are oriented according to the transition matrix, with each row indicating the first state and each column the second state of a one-step transition. e, corresponding density plots of transition probabilities using a Gaussian kernel estimator for each marginalized parameter. Suppl. Fig. 2 -Analysis of sampler output for basic HMM for FcγR IIc. a, state numbering scheme of the HMM graph depicted in Fig. 3d. b, trace of non-normalized density values of 10 separate sampler runs. c, autocorrelation plot of a single sampler run for each varying transition probability. d, trace plot of transition probabilities from 10 separate sampler runs. The plots are oriented according to the transition matrix, with each row indicating the first state and each column the second state of a one-step transition. e, corresponding density plots of transition probabilities using a Gaussian kernel estimator for each marginalized parameter.
Suppl. Fig. 3 -HMM models for FcγR IIIa and IIc to represent loss of heterozygosity. For the corresponding basic models, compare Fig. 3c and d. a, HMM for FcγR IIIa including repetition loops that are dependent on the allele, reproducing the loss of heterozygosity signature predicted for the microhomology-mediated break-induced repair mechanism (compare Fig. 2d). The Bayes factor for this model was 1 20.4 compared to the basic model shown in Fig. 3c. b, corresponding HMM for FcγR IIc. The Bayes factor for this model was 1 11 compared to the basic model shown in Fig. 3d. The graphical representation is as described in Fig. 3. Transitions from 1 to 2 from 6 to 2 from 2 to 3 from 3 to 3 from 4 to 3 from 5 to 3 from 2 to 4 from 3 to 4 from 4 to 4 from 5 to 4 from 2 to 5 from 3 to 5 from 4 to 5 from 5 to 5 from 3 to 6 from 4 to 6 from 5 to 6 from 1 to 7 from 6 to 7  a Probabilities are equivalent to corresponding transition probabilities of the basic CNAV-reproducing HMM. p, probability of allele A; q, probability of allele B; r, probability of copy number gain; s = 1 − r; z, probability of deletion; y = 1 − z; compare Suppl. Fig. 6 b Example probabilities A (p = 0.8), B (q = 0.2), z = 0.1, r = 0.2