Introduction

Deposited in bioinformatics databases is a wide range of available sequence data obtained from different high throughput sequencing tools. This wealth of data, when accompanied by advances in computational methods, has revolutionized the study of genome variation under the new emerging field of systems biology.

More than 99% of human genome is identical among individuals as well as different ethnic groups. In other words, less than 1% of genetic differences are responsible for all of the observed variations among people all over the world1.

Therefore, specifying these differences in genetic material and evaluating the distribution on the DNA sequences of different human populations may have important implications in solving various problems in biology and medicine. In line with this assumption, two leading projects, the international haplotype map (Hapmap)2 and 1,000 genomes3, have been pursued to characterize common patterns of human genetic variations.

Single nucleotide polymorphisms (SNPs) are the most common types of genetic variations in the human genome. SNP refers to the occurrence of different single nucleotides at specific positions in the human genome, which resulted from mutations followed by natural selection during the evolutionary time scale. The possible nucleotides define alleles for that position4. A SNP sequence along each chromosome is known as a haplotype. Both SNPs and haplotypes provide valuable information for assessing genetic variations in a systematic manner. Different research fields, such as disease susceptibility, drug design, and genome-wide association studies (GWASs)5, can greatly benefit from this data.

The distribution of SNPs across genome elements has been investigated by a multitude of studies. These have illustrated that SNPs tend to be clustered across the genome elements in a deterministic manner in which the position of the each mutation is usually affected by its neighbors and the sequences of SNPs are often highly correlated with each other6,7,8. Based on this finding, several studies have proposed models to describe how SNPs clustered along the genome sequence lead to the construction of haplotypes9,10,11. In order to identify genes involved in genetic diseases, massive amounts of SNP and haplotype data were utilized by GWASs to detect highly statistically significant correlations between SNPs on the genetic materials and various numbers of phenotypes12. (https://ghr.nlm.nih.gov/primer/genomicresearch/gwastudies). These are essential for the prediction, diagnosis, prevention or medical therapy of diseases by contextualizing reference big data and provide the basic elements of modern personalized medicine13,14,15,16. Accordingly, identifying haplotypes, particularly when input fragments contain large sections of gaps without enough coverage is critical. In order to handle reading errors, low coverage, and large number of input fragment gaps, several fragment assembly algorithms have been proposed to reconstruct the haplotypes from fragments of homologous human chromosomes from a single individual17,18,19,20,21,22,23,24. The identification of correlation between SNPs is the key challenge of recognizing haplotype sequences. Chaos theory provides a powerful tool for discriminating between random and deterministic processes if a suitable phase space embedding can be found. Indeed, several studies have shown that the underlying information structure can be revealed by chaos theory without the reliance on the respective equations of systems dynamics25. Based on this assumption, Chaos theory has been increasingly applied in Life sciences for understanding the complexity of biological systems26. For example, many attempts have been made to explain the chaotic behavior of biological sequences27,28,29,30,31,32. Moreover, chaotic view point has been applied to evaluate biological signals such as electroencephalogram (EEG) signals33,34,35.

Mapping protein sequences in 2D space with chaos game representation (CGR) has shown that the structural classes of proteins can be distinguished by comparing their chaotic behavior36. CGR is an iterative mapping algorithm which was initially developed by Jeffrey37 for visualizing genomic sequences as chaotic systems38. This method can transform an input one dimensional biological sequence into an intuitive two dimensional picture39.

This study utilized nonlinear chaotic analysis with a surrogate data test and multi-fractal analysis to determine whether haplotypes can be detected as non-random SNP sequences. Also NA12878 dataset was used in binary form containing haplotype sequences of all human chromosomes.

Since SNPs defining haplotypes are highly correlated with each other, several subsequences are extracted from each haplotype sequence and each haplotype is locally evaluated. After the CGR method transformed each subsequence to a line, the corresponding coordinate series was extracted and its chaotic behavior was evaluated by a surrogate test. For more detailed assessment, a multi-fractal spectrum of the sequences was also computed. The resulting data confirmed that haplotype sequences of representative chromosomes originate from a non-stochastic process involving the neighbor effect of its constituents.

In order to test the ability of the proposed model to accurately manipulate haplotype sequences, single individual haplotype (SIH) reconstruction as a complicated task in computational biology was taken into account and the knowledge of chaotic behavior was utilized to improve the rate of haplotype reconstruction. The main concern of SIH is the reconstruction of haplotypes from several input fragments originating from a given sequencing method. As mentioned earlier, sequencing errors and missing information (gaps) are the main challenges in dealing with this problem. Existing methods suffer from huge numbers of gaps as these lead to positions with low coverage and thus low confidence in attempts to identify alleles in such areas4. Here, the current work mapped each haplotype by CGR and extracted a coordinate series in the same way as previously described. The Local Projection (LP) method then locally estimated the trajectory in the neighborhood of each ambiguous point and, finally, each ambiguous point was determined by a projection to the resultant curve. The experimental results revealed that utilizing the knowledge of chaotic behavior can help improve the reconstruction rate and also play a complementary role in the existing methods.

Materials and Methods

In order to provide a comprehensive analysis of biological sequences, the current study applied a five-step rule, as proposed by Chou40, in the following order: (a) provide a valid dataset to evaluate the hypothesis; (b) express biological sequences with appropriate mathematical notations while preserving all of their hidden information; (c) explain the proposed method exactly; (d) evaluate the final results; and (e) provide the source code of implementations.

Materials

The current work’s dataset included the HapMap NA12878 Whole-Genome Sequence (WGS) sample for a European (i.e., CEU) female individual, also known as HG001, which is a well-known reference genome dataset containing haplotype sequences of all human chromosomes41,42. The reference haplotypes were the trio-phased variant calls from the GATK resource bundle43. They were produced by a fosmid-based technology from the HapMap sample NA12878 and filtered in 1,252,769 positions that were also covered by fragments of the NA12878 dataset.

Chaos game representation

CGR is a well-known algorithm which iteratively maps an input sequence into 2D space. This mapping leads to visualization of the input sequence in a picture. Furthermore, this procedure can reveal the hidden patterns of subsequences28.

For sequences with four alphabets, such as DNA, the final picture takes on a square format. Each vertex equals one nucleotide, i.e. A, T, C, and G. The sequence is mapped in the area of the square with a unit length such that each nucleotide base is plotted as a point. The first point is plotted at the center of the square. Next, the first base is placed halfway between the center of the square and the vertex which corresponds to the first base. As seen in the following formula, the coordinate of the ith base (bi) is placed halfway between the (i − 1)th point and the respective vertex (vi).

$${b}_{i}=0.5\times ({b}_{i-1}+{v}_{i})a$$
(1)

The plot is known as the CGR of the input sequence.

Analysis of nonlinear time series

Suppose Xt is a scalar time series where t = 1, 2,…, N. If this time series is observed from a deterministic phenomenon perspective, it can be projected into a low dimensional state space called phase space. If \({Y}_{t}=\{{x}_{t},{x}_{t+\tau },\ldots ,{x}_{t+(m-1)\tau }\}\) is Xt in the phase space, then the phase space can be reconstructed according to Takens’ embedding theorem44. For this purpose, parameters \(\tau \) and m, as the time delay and embedding dimension respectively, should be determined. Dimension m completely demonstrates the object and its topological features. There are several approaches, such as Average Mutual Information (AMI)45 and False Nearest Neighbor (FNN)46, which heuristically estimate phase space parameters based on the available data.

Lyapunov exponent

Sensitive dependence on initial conditions is one of the main properties of chaotic systems. For an m dimensional chaotic system, the Lyapunov exponent (λ) is a spectrum containing m real numbers which quantifies sensitivity to initial conditions. It should be noted that the sign of its largest measure is positive for chaotic systems and its quantity indicates the extent of the chaotic system’s predictability.

Suppose Y(0) and \({Y}_{{\epsilon }}(0)\) are two initial neighbor points in phase space, in which\(\Vert Y(0)-{Y}_{{\epsilon }}(0)\Vert ={\epsilon }\). By the evolution of time, the points are separated and the average of this separation, equaling λmax, is obtained according to the following equation:

$${\lambda }_{max}={\mathrm{lim}}_{t\to \infty }\mathop{\mathrm{lim}}\limits_{{\epsilon }\to 0}\frac{1}{t}\,\mathrm{ln}(\frac{\Vert Y(t)-{Y}_{{\epsilon }}(t)\Vert }{{\epsilon }})$$
(2)

Although calculating λ from experimental data is a difficult task, several methods have been proposed to determine the largest Lyapunov exponent47,48,49,50. In this study, the Eckmann’s method51 was chosen because it is one of the most practical approaches for determining the Lyapunov exponent from the experimental data52. The first step involves mapping Xt as a scalar time series to \({Y}_{t}=\{{x}_{t},{x}_{t+\tau },\ldots ,{x}_{t+(m-1)\tau }\}\) by reconstructing the phase space. Suppose Yj and \({Y}_{j+{\tau }_{2}}\) are two points in the phase space such that \(a{Y}_{j+{\tau }_{2}}\) is the evolution of Yj which has been provided by a rule or map as below:

$$F({Y}_{j})={Y}_{j+{\tau }_{2}}$$
(3)

In the above relation, τ2 is the iteration step which can be selected independently from τ. In the next step, for each point Yj, all of its neighbors is found. Suppose \({Y}_{j}^{r}\) is the rth nearest neighbor of Yj, calculating the Lyapunov exponent involves determining \(\underline{D}\,F({Y}_{j})\) which maps all neighborhoods of vectors \({Y}_{j}^{r}-{Y}_{j}\) to \({Y}_{j+{\tau }_{2}}^{r}-{Y}_{j+{\tau }_{2}}\). It should be noted that \(\underline{D}\,F({Y}_{j})\) is the m × m Jacobian matrix of F at Yj.

The Lyapunov exponent is obtained by calculating the eigenvalues of the matrix (\(\underline{D}\,{F}^{K})^{\prime} \underline{D}\,{F}^{K}\) where \(\underline{D}\,{F}^{K}\) is computed as below:

$$\underline{D}\,{F}^{K}=\underline{D}\,F(K).\,\underline{D}\,F(K-1)\ldots \underline{D}\,F(1)$$
(4)

where K is an arbitrary integer of evaluation points, and \(\underline{D}\,F(K)=\underline{D}\,F({Y}_{K})\).

Correlation dimension method

Suppose X is a chaotic time series whose attractor has been reconstructed in phase space. The correlation dimension method is one of the most fundamental approaches for studying chaotic time series, by which its measure describes the complexity of the attractor53. The correlation dimension can be expressed by Equation (5):

$$C(r)=\frac{2}{(N)(N-1)}{\sum }_{i,j=1}^{N}H(r-\Vert {Y}_{i}-{Y}_{j}\Vert )$$
(5)

where N is the number of m-dimensional points on the reconstructed space, Yi is the delay vector, r is a neighborhood, and H is the Heaviside step function. C(r) is computed for a range of neighborhood sizes r and a range of embedding dimensions m. The next step plots the slopes of C(r) against r on a log-linear plot. For each embedding dimension, there may be a specific curve. If these curves saturate on a common plateau, their y-value is a measure of the correlation dimension. The following describes the relationship between r and C(r):

$$C(r)\propto \alpha {r}^{{D}_{2}}$$
(6)

where α is a constant value and D2 is the correlation dimension given by Equation (7):

$${D}_{2}=\mathop{\mathrm{lim}}\limits_{r\to 0}\frac{\mathrm{log}\,C(r)}{\mathrm{log}\,r}$$
(7)

As seen in the above formula, D2 is estimated based on the linear region, which is found between the depopulated and saturated regions. It should be noted that the depopulated region refers to the area of the plot with no pairs of points. The saturated region includes a large value of r where C(r) reaches a constant value.

It should be emphasized that the correlation dimension is suitable for situations in which the chaotic behavior of a given system is known. In other words, the correlation dimension is unable to distinguish between the stochastic and deterministic processes.

Surrogate data test

Surrogate data test is a Monte Carlo-based algorithm which can detect the chaotic behavior of an existing time series. This test supposes that the given time series is random and is provided by a stochastic process. Then, an arbitrary amount of surrogate data is generated. These data are random but preserve the statistical properties of the original data. The test starts with the hypothesis that the original time series is random. Next, a method for nonlinear time series analysis is chosen, such as that of extracting a correlation dimension, and this measure is computed for the original and surrogate time series. If the results for the original time series are completely different with those for the surrogate time series, then it can be concluded that the hypothesis is not true. In other words, the original time series is related to a deterministic process. It is worth mentioning that the generated surrogate data cover most of the subset of the stochastic process class.

There are several ways to generate surrogate data, the most important of which consists of the following steps. First, Fourier transform converts the original time series to the frequency domain. Then, each element is changed by multiplication to a random phase with a unit magnitude. The resulting data are transformed back by inverse Fourier transform and, finally, randomized data with the same power, known as surrogate data, are generated.

Multi-fractal analysis

Multi-fractal refers to elements composed of several simple fractal objects. Fractal dimension cannot describe these objects’ dynamic behavior. Instead, a continuous spectrum, namely the generalized fractal dimension, was developed54. When the attractor of a given time series is plotted in a phase space, this time series reveals chaotic behavior when the attractor is fractal or multi-fractal. Accordingly, multi-fractal analysis, as well as the surrogate data test, can help reveal the chaotic features of a given object. Up until now, several approaches have been proposed for implementing multi-fractal analysis. Fixed size box-counting is one of the most popular methods employed for solving various problems. As expressed in the following relationship, the surface of a given object is covered by several identical size ε boxes. μ is an arbitrary function which calculates the density of points (B) for each of the boxes. The partition sum of all non-empty boxes can then be calculated according to Equation (8):

$$\begin{array}{ll}{Z}_{\varepsilon }(q)={\sum }_{\mu (B)\ne }{[\mu (B)]}^{q} & q\in R\end{array}$$
(8)

In the above relationship, q can assume any real value for discriminating the sparse from the dense regions. Equation (9) calculates the mass exponent:

$$\tau (q)={\mathrm{lim}}_{\varepsilon \to 0}\frac{\mathrm{ln}\,{Z}_{\varepsilon }(q)}{\mathrm{ln}\,\varepsilon }$$
(9)

Finally, the generalized fractal dimensions are defined by the following relationships:

$${D}_{q}=\frac{\tau (q)}{(q-1)},\,{\rm{for}}\,{q}\ne 1$$
(10)
$${D}_{q}={\mathrm{lim}}_{\varepsilon \to 0}\frac{\sum \mu (B)\mathrm{ln}\,\mu (B)}{\mathrm{ln}\,\varepsilon }\,,{\rm{for}}\,{\rm{q}}=1$$
(11)

f(α) spectrum is used to evaluate the multi-fractal behavior of the data. Equation (12) expresses this measure:

$$f(\alpha )=q\alpha (q)-\tau (q)$$
(12)

Here, α(q) is the Lipschitz-Holder exponent which determines the singularities of a measure. This measure is related to τ(q) and is given by the following relationship:

$$\alpha (q)=\frac{d}{dq}\tau (q)$$
(13)

It should be noted that f(α) can determine the strength of multi-fractality, such that a narrower spectrum demonstrates weak multi-fractal behavior and a broader spectrum indicates stronger multi-fractality behavior.

Results and Discussions

Extracting subsequences

The analysis was performed on the full length sequences of distinctive haplotypes as well as the subsequences of haplotypes, as described below. Since the overall results of the full length analysis indicated that these sequences did not exhibit chaotic behavior, a detailed analysis of their subsequences is provided here.

As shown in Fig. 1, a number of SNPs, whose distances were less than predefined threshold r, constructed subsequence Si. Since the extracted subsequences should have had the minimum data length required for chaos analysis, the present study assumed r equals 30,000 and selected subsequences whose lengths were greater than Thr (800) for further analysis.

Figure 1
figure 1

S1, S2, and S3 are three subsequences whose SNP distances are less than predefined r.

By applying these cut-offs, different numbers of subsequences were extracted for each chromosome. Table 1 presents the total number of subsequences and those with lengths greater than the threshold value.

Table 1 Subsequence information extracted from all chromosomes.

Chaos game representation of haplotype sequences

CGR is an iterative mapping algorithm which can provide a visualize form for a biological sequence. Detailed examination of the obtained picture can reveal the chaotic behavior of a system in terms of the local patterns of the sequence38. In order to quantitatively assess the output of CGR, a coordinate series is extracted containing all positions of the CGR picture. A typical CGR picture and its corresponding coordinate series for the first subsequence of Chromosome 1’s haplotype are shown in Fig. 2’sA,B, respectively. It is worth mentioning that individual values of input sequences correspond to unique points in the CGR picture and vice versa.

Figure 2
figure 2

Binary CGR of the first extracted subsequence (A) and its extracted coordinate series (B) from the first subsequence of Chromosome 1. Since the input sequences consist of two alphabets, the obtained pictures resemble dotted lines.

Since all information is preserved in the CGR plot, as well as in its respective coordinate series, the resulting data of the coordinate series assessment can be attributed to its related CGR picture.

Surrogate test

In order to examine the nonlinear properties of the original input data, the statistical surrogate data testing method was applied on individual coordinate series. The following procedure prepared the surrogate data. For each coordinate series, the embedding dimension and time delay were first determined and 10 surrogate coordinate series were then generated according to the method reviewed earlier. After this, the correlation dimension was computed for the original and its related surrogate coordinate series.

Table 2 presents the typical results of the surrogate test, including the correlation dimension values for the original coordinate series along with the minimum and maximum surrogates for all Chromosome 1 haplotype subsequences. The last column contains the values of the largest Lyapunov exponent (LLE) of each subsequence.

Table 2 Results of the surrogate test and LLE measures for the extracted subsequences of Chromosome 1’s haplotype.

According to Table 2’s data, the null hypothesis of the surrogate test was rejected for most of the subsequences and the sensitivity for the initial condition was confirmed for subsequences containing positive LLE. Thus, Table 2’s data demonstrates that 74% of Chromosome 1’s haplotype subsequences exhibited chaotic features. The surrogate test and LLE computation were also carried out for the other chromosome haplotypes. As shown in Fig. 3, some percentages of subsequences in other haplotypes involved chaotic features. The resulting data indicate that most of the extracted subsequences exhibited chaotic behavior. Since mapping from sequences to coordinate series preserves all the information, the obtained results indicate that these subsequences originated from deterministic behavior.

Figure 3
figure 3

Percent of chaotic subsequences found throughout all chromosome haplotypes.

Multi-fractal analysis

To confirm the above results, whose findings observe chaotic behavior in coordinate series extracted from subsequences, the present study applied multi-fractal analysis for CGRs. Multi-fractal analysis is another method for examining chaotic behavior. For investigating the scaling behavior of the CGR picture, CGRs were covered by several size ε boxes. In line with this assumption, if ε was equal to \(\frac{1}{8}\), for instance, the CGR image was covered by eight boxes. Here, the input sequences had two alphabets and so the resulting CGR was a dotted line. With this method of analysis and according to the density of points in each box, the multi-fractal parameters, including Dq, α, and f(α), were calculated. Figure 4 provides the results of this calculation for the first subsequence of Chromosome 1’s haplotype.

Figure 4
figure 4

Three curves related to the first subsequence of Chromosome 1’s haplotype. (A) represents Dq, (B) is related to spectra α, and (C) represents f(α).

As presented in Fig. 4, the values of Dq, as different fractal dimensions for various q measures, as well as those of f(α), confirm different fractal dimensions. This reveals multi-fractality of the CGR picture. Therefore, the statistical self-similarity of the CGR should be described by a spectrum of fractal dimensions. Figure 5 shows the existence of multi-fractal property demonstrating that the subsequences originated from a deterministic process. It should be noted that multi-fractal curves reveal structural information which is hidden in the original subsequences.

Figure 5
figure 5

Multi-fractal curves of all subsequences related to Chromosome 1’s haplotype (A) dimension spectra, (B) specturm α, and (C) f(α). All of the selected subsequences have positive LLE and passed the surrogate test.

Altogether, these findings indicate that the establishment of full length haplotype sequences can yield new features under the laws of stochastic processes. In addition, these results show that it is possible to extract some other features of subsequences for the purpose of evaluating their similarity and clustering state in a given haplotype by employing chaos theory assumptions. In particular, chaotic analysis reveals the deterministic nature of haplotype sequences. In some problems, such as haplotype assembly, in which the amount of noise and coverage rate of input fragments can affect the quality of reconstructed haplotypes, it is possible to rectify the achieved haplotypes via some features, such as the correlation between neighboring SNPs.

Exploiting CGR for haplotype reconstruction

To improve the reconstruction rate in the single individual haplotype (SIH) problem, one can explore how to use the chaotic feature in the extracted subsequences. In diploid organisms, such as humans, chromosomes are in pairs inherited from the father and mother respectively. As seen in Fig. 6, SIH involves several input fragments, known as reads (ri), which are attained from a defined region on a pair chromosomes, based on a sequencing read technology. A set of reads can be represented by matrix M × N, namely R, where each element rij belongs to {0, 1, ‘−’}. It is worth to noting that coverage refers to the number of reads that cover a certain position. For example, as can be seen in Fig. 6, coverage of the first column equals with 4, because it has been covered by r1, r5, r7, and r9. The haplotype assembly attempts to reconstruct haplotypes h1 and h2, such that these sequences are the most compatible with the input fragments. As mentioned earlier, the existing approaches show a low performance when matrix R involves columns with insufficient coverage. In these columns, there is not sufficient data to determine the measure of alleles with high confidence. Moreover, perhaps there are some positions which are not covered by any input fragments. In such cases, these positions will remain ambiguous and be represented by gaps.

Figure 6
figure 6

Example of an SIH problem. Matrix R contains input fragments. h1 and h2 are the reconstructed haplotypes. Positions 4 and 6 are ambiguous because they are not covered by any input fragments.

The results in previous sections show the route of chaos in several extracted haplotype subsequences. These findings reveal a dependency among SNPs, which can serve as promising knowledge for determining the measure of alleles in ambiguous positions. For this purpose, it is necessary to first provide test sequences with gap positions. These sequences were obtained by corrupting the evaluated subsequences in a substitution of 10% of the individual subsequences for gaps. Next, each corrupted subsequence was mapped by CGR and its corresponding coordinate series was extracted as mentioned before. When dealing with gap measures, the algorithm was restarted and the next point was added between the center of line and the obtained allele. In the next step, the LP method estimated the allele measures in the gap positions. In deterministic chaotic flows, LP is generally applied for noise reduction. Since chaotic attractors are limited in their phase space, each divergence can be interpreted as noise. Regarding to this fact, the phase space is reconstructed. Afterwards, LP enhances the trajectories of the attractor locally. In particular, for each point, a set of its neighbors is found and a local curve which approximates them is determined. Finally, the considered point is updated by projecting to the resultant curve. Readers interested in the details of the LP algorithm may refer to an excellent book by Kantz and Schreiber55.

In this problem, projection was only limited to the points which indicated gap positions. It should be noted that these points were initially set by the average of their neighbors’ measures. Next, a set of neighbors containing 2N + 1 points was constructed, which consisted of the considered point as well as N points before and N points after the considered point.

Figure 7 demonstrates a part of the coordinate series extracted from the first subsequence of Chromosome 1, which is fitted by the LP method. The star signs indicate positions with ambiguous measures projected to the fitted curve. To specify the allele measure of a gap position in the haplotype, the next step converts the value of the projected point to 0 or 1 by comparisons with the threshold.

$$h(i)=\{\begin{array}{ll}0 & if\,cs(i)\le 0.5,\\ 1 & Otherwise,\end{array}$$
(14)

where cs(i) refers to the ith ambiguous measure projected to the fitted curve and h(i) is the output for the ith position with a gap measure.

Figure 7
figure 7

A part of the extracted coordinate series from Chromosome 1’s subsequence (Data), the fitted curve which is the output of the proposed method (Green Curve), and the star points indicating positions with ambiguous values which are determined by a projection to the fitted curve.

The proposed method was employed for all obtained coordinate series from all subsequences of all chromosomes. The reconstructed subsequences were compared with the original subsequences. Figure 8 depicts the percentage of improvements. As seen in Fig. 8, the boxplot demonstrates the deviation of the reconstruction rate for all subsequences belonging to a specified chromosome. The results demonstrate that the rate of reconstruction for all subsequences was more than 97% overall.

Figure 8
figure 8

The deviation of reconstruction rates for all extracted subsequences from all chromosomes.

Conclusion

The current study investigated the chaotic behavior of haplotype sequences by considering the distribution of SNPs and mapping them with the CGR algorithm. The application of surrogate test and multi-fractal analysis procedure on a haplotype dataset demonstrated that the full length of chromosomes did not exhibit chaotic behavior. However, it was found that various numbers of subsequences throughout all haplotypes showed a deterministic nature. According to these findings, the laws of chaotic and stochastic processes can be employed for modeling haplotype sequences in a size-dependent manner. Moreover, the present study applied this knowledge to improve the reconstruction rate in the haplotype assembly problem. These promising results demonstrate that chaotic viewpoint can be effectively utilized to determine alleles in gap positions or low coverage positions. Finally, the source code used in the current work is available from the author upon request.