A biological sequence comparison algorithm using quantum computers

Kösoglu-Kind, Büsra; Loredo, Robert; Grossi, Michele; Bernecker, Christian; Burks, Jody M.; Buchkremer, Rüdiger

doi:10.1038/s41598-023-41086-5

Download PDF

Article
Open access
Published: 04 September 2023

A biological sequence comparison algorithm using quantum computers

Büsra Kösoglu-Kind¹,
Robert Loredo^2,3,
Michele Grossi⁴,
Christian Bernecker⁵,
Jody M. Burks² &
…
Rüdiger Buchkremer¹

Scientific Reports volume 13, Article number: 14552 (2023) Cite this article

2695 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

Genetic information is encoded as linear sequences of nucleotides, represented by letters ranging from thousands to billions. Differences between sequences are identified through comparative approaches like sequence analysis, where variations can occur at the individual nucleotide level or collectively due to various phenomena such as recombination or deletion. Detecting these sequence differences is vital for understanding biology and medicine, but the complexity and size of genomic data require substantial classical computing power. Inspired by human visual perception and pixel representation on quantum computers, we leverage these techniques to implement pairwise sequence analysis. Our method utilizes the Flexible Representation of Quantum Images (FRQI) framework, enabling comparisons at a fine granularity to single letters or amino acids within gene sequences. This novel approach enhances accuracy and resolution, surpassing traditional methods by capturing subtle genetic variations with precision. In summary, our approach offers algorithmic advantages, including reduced time complexity, improved space efficiency, and accurate sequence comparisons. The novelty lies in applying the FRQI algorithm to compare quantum images in genome sequencing, allowing for examination at the individual letter or amino acid level. This breakthrough holds promise for advancing biological data analysis and enables a more comprehensive understanding of genetic information.

Α Quantum Pattern Recognition Method for Improving Pairwise Sequence Alignment

Article Open access 10 May 2019

Ultra-fast genome comparison for large-scale genomic experiments

Article Open access 16 July 2019

Large scale sequence alignment via efficient inference in generative models

Article Open access 04 May 2023

Introduction

According to Lawrence et al., there are ongoing significant international endeavors to develop comprehensive gene catalogs that can identify the genes responsible for the onset and progression of diseases¹. Recent findings suggest that the list of cancer-associated genes deemed significant is expanding rapidly and in a manner that seems improbable. Wang et al. emphasize the urgent requirement for practical diagnostic tools to address the COVID-19 pandemic². They note that the current targets, particularly the nucleocapsid (N) gene primers, and probes widely used for diagnosis, are experiencing mutations. Hasin et al. argue that the sequence of amino acids not only plays a crucial role in genomics but also other “omics” such as proteomics, transcriptomics, and metabolomics³. Amino acid sequences and mutations provide the underlying information flow contributing to disease development. In bioinformatics, amino acid sequence information is represented as letter sequences, which can be analyzed to identify differences and commonalities in RNA and protein gene products. This analysis helps uncover functional and structural insights. Genome analysis is also vital for optimizing industrial drug or food production processes. Quantum computers, a revolutionary technology, promise to advance sequence research in medicine and biochemistry. Limitations of the current near-term systems include but are not limited to decoherence, gate fidelity, connectivity, and lack of error correction⁴.

Quantum computing has made significant contributions to various biochemical inquiries. Fox et al. investigate the use of Quantum Annealers (QAs) to predict the secondary structure of RNA⁵. They demonstrate the speed and efficacy of QA in identifying low-energy solutions, presenting a competitive alternative to classical algorithms and offering potential advancements in RNA folding predictions. Wong and Chang propose a quantum algorithm utilizing Grover’s search algorithm for protein structure prediction. Their approach achieves a quadratic speedup compared to classical methods, showcasing the potential of quantum computing to enhance the efficiency and accuracy of protein structure prediction^6,7. Successful simulations on IBM Quantum’s qasm simulator support their findings, emphasizing the importance of this advancement in drug and vaccine development.

Robert et al. propose a resource-efficient quantum algorithm for protein folding⁸, while Chandarana et al. introduce a digitized-counter-diabatic quantum algorithm for the same purpose⁹. These algorithms demonstrate significant speed improvements over classical approaches, offering promising avenues for protein folding research. Nałęcz-Charkiewicz and Nowak propose an algorithm for DNA sequence assembly using quantum annealing, showcasing its superior speed compared to classical methods¹⁰. Boev et al. present a genome assembly algorithm that combines quantum and quantum-inspired annealing, achieving notable speed enhancements over classical algorithms¹¹. Sarkar et al. propose a quantum algorithm for de novo DNA sequence reconstruction based on the Variational Quantum Eigensolver (VQE) algorithm, showing a significant speed advantage over classical algorithms¹². These studies collectively indicate that quantum computing holds immense potential for revolutionizing protein folding and DNA sequence assembly.

However, no study has specifically addressed the search for single amino acid mutations at the letter level, highlighting an area that requires further exploration.

Quantum computing offers potential acceleration for solving complex problems, including mutation searches and pattern recognition in gene sequences¹³. Unlike classical computers that use single bits with two possible states (0 or 1), quantum computers utilize quantum mechanical states, such as superposition, entanglement, and inference, to find solutions ^14,15. These principles have demonstrated speed-ups in various problem domains. They operate using qubits, the fundamental units similar to classical bits. However, quantum systems are still in their early stages of development and have not reached the point of quantum advantage, where they can solve classically intractable real-world problems. Limitations of near-term quantum systems include decoherence, gate fidelity, connectivity, and lack of error correction⁴. Efforts are focused on optimizing quantum circuits to minimize errors through software and hardware transpiration techniques. These techniques help mitigate decoherence and maintain the quantum state during computations¹⁶. Hardware measures ensure the isolation of qubits from environmental interferences, presenting a scientific challenge to minimize error sources.

Quantum computers hold potential beyond life sciences, extending to finance and manufacturing. Our paper aims to identify sequence similarities using a quantum computer based on existing encoding patterns and techniques^17,18. Image processing, a well-researched area in computer science, is chosen as an approach for mutation searches due to its similarity in identifying sequential digital information. The implementation utilizes a Flexible Representation of Quantum Images (FRQI) to compare quantum images on an IBM quantum computer¹⁹. This method effectively encodes differences among data using quantum states.

In genome sequencing, one of the known challenges is the search for exact or approximate matches²⁰. String matching algorithms are commonly employed for the former, while the Levenshtein distance is used for the latter²¹. These algorithms fall under the category of alignment approaches, which can be further classified as global, local, or heuristic alignments. Notable alignment methods include the Needleman-Wunsch algorithm²², Smith-Waterman algorithm, BLAST²³, and FASTA²⁴. However, state-of-the-art techniques are still characterized by high costs and time requirements. The Needleman-Wunsch algorithm performs a recursive calculation of two sequences within a matrix. One sequence is represented along the x-axis, while the other is along the y-axis. The similarity between the sequences is evaluated using a scoring function that penalizes gaps and mismatches. This algorithm has a time complexity of O(mn), where m and n denote the sizes of the two sequences. The memory consumption is O(nm). For instance, comparing two sequences of size 100,000 using a 4-bit integer representation would result in a memory consumption of 37 gigabytes²². To mitigate this, the Hirschberg algorithm reduces memory consumption to linear space, specifically O(m + n)^25,26.

Quantum memory provides an alternative approach, where information can be stored in the superposition of qubits. The entanglement of qubits enables exponential memory growth. In a one-dimensional sequence, only [log2(n)] qubits are required, whereas a classical system would need N bits [9] [10] (refer to Fig. 1). When comparing two sequences, the memory consumption is given by ([log2(n)] + log2(m)) qubits.

Understanding the abbreviated language of biological information

Biological information, such as genetic sequences, is represented by letters denoting chemical residues in the order of their appearance in the biochemical polymer chain. In nucleic acid sequences (DNA, RNA), the individual nucleotides are indicated by phosphate groups attached to the sugar’s 5′ carbon (the “5′ end”) and unmodified hydroxyl groups on the sugar’s 3′ carbon (the “3′ end”), following a 5′ to 3′ directionality. In protein sequences, the representation starts with an amino acid having a free amino group and ends with an amino acid having a free carboxyl group (N to C direction). Each type of biological sequence has its specific letter code representation, such as A, G, C, and T for DNA nucleotides (Adenine, Guanine, Cytosine, and Thymine), and amino acid codes like W for Tryptophan, G for Glycine, P for Proline, and so on²⁹.

Biochemical identity or similarity

The most common method for comparing biological sequences involves comparative studies using sequence alignments. In alignments, sequences are arranged so that regions of importance can be compared (see Fig. 2) and evolutionary, functional, structural, or other features can be identified and analyzed³⁰. At least two sequences are required for the simplest of alignments (pairwise alignments comparing two sequences), and multiple sequence alignments compare three or more sequences. Common themes analyzed for various reasons (mutational analyses, phylogenetic relationships, comparative sequence analysis for structure–function relationships³¹ include identity and similarity. Identity between sequences is where, in an alignment, two or more sequences contain the same exact residue (e.g., adenine and adenine) at the same relative position in that residue sequence. Sequence similarity means that the residues are from the same biochemical family. In nucleotides, purines are similar (adenine and guanine), and pyrimidines are similar (cytosine and thymine). In amino acids, similarity can be informed by the makeup of the amino acids R groups. Identity and similarity scores for a given alignment are calculated using similarity matrices such as BLOSUM or PAM³². Because we mention two different contexts for identity and similarity (biochemical versus frequency), sequence identity and similarity about biochemical information (Adenine, Guanine, etc.) will be referred to as “biochemical” identity or similarity.

Results

Mapping genetic sequences onto quantum computers using Toffoli and basis gates

In current quantum computing algorithms, information is generally mapped to a quantum state that represents the data in a way to be able to run on a quantum computer. In this case, the quantum state will be encoded via gate-based operations provided by the quantum computer. To map a genetic sequence on a quantum computer, we use a set of gates to represent two pieces of information: the position of the biological sequence and the value at that position. We use multi-control gates to entangle the information together on a quantum computer. One of the most common multi-control gates is the Toffoli gate. In Fig. 3 below, a Toffoli gate³⁶, which in our experiment entangles three qubits together where the first two represent the position and the third is the value at the position. A Toffoli gate is a 3-qubit gate with two control qubits and one target qubit. In Fig. 3, the Toffoli gate is shown with three qubits. q0 and q1 represent the control qubits and influence the target qubit q2. The control, identified by the solid sphere, triggers the action of the target qubit identified by the larger sphere with a symbol indicating the type of control; in this case, it is a NOT gate. When both q0 and q1 are enabled (set), the action at the target qubit is performed. If q0 and q1 are not enabled, then no action is performed on the target qubit, q2. Quantum gate-based systems have one similarity to classical techniques in that they use what is referred to as basis gates to construct more enormous complex gates, such as the Toffoli gate. In classical systems, these basis gates are often referred to as universal gates, such as the AND, NOT, and NAND gates. Quantum systems also have basis gates, U and CNOT gates are two examples of single-qubit and multi-qubit gates, respectively. Therefore, to construct a Toffoli gate using the basis gates available on a quantum computer, you will need a combination of nine universal gates, also called U-gates, and six multi-qubit gates (CNOT), which when combined, as illustrated in Fig. 3, a circuit depth of 11. A qubit can execute several such gates, which initially do not involve restrictions. But the more gates applied on a qubit, the deeper the quantum circuit becomes. Over time will begin to experience some of the effects of noise, such as decoherence, which could then introduce errors to the results of your experiment.

Representing DNA residues on a quantum circuit through parameterized rotations

Conveying the information onto a quantum computer involves a few steps. The first step is to determine to represent the states of our sequence. The table below shows four DNA residues and, thus, a gene sequence defined on a quantum circuit where the theta function will represent the four variations. It is required to parameterize the quantum states from angles to a different value in the code because of the software convention. Rotations around the axis, i.e., π, are used as a qubit state. This allows us to represent each state by an angle, represented in degrees or radians. These angles determine the position in which the respective qubits are rotated so that they can be recognized directly based on the angle definition, whether the respective qubit is encoded as A, C, T, or G. The angle definition is used to determine the qubit’s state (Table 1).

Table 1 Quantum state representations of DNA residues and parameterized U gate rotations used in this study.

Full size table

The qubit represents adenine (A) by setting the parameter to π. The Multi-control qubit gate definition of π derived from mapping the state to the basis gate on the device. Basis gates are native to the quantum hardware commonly used to create more complex gates, like universal gates in classical computing. To better illustrate this, Fig. 4 represents the Adenine in a circuit. The five wires represent the qubits on which the multi-control gates and operators are applied. The first qubit, titled “strip,” illustrates how we will indicate which sequences we wish to compare; this will be explained in more detail in the next section. The following figure represents the encoding of the sequence at position (1,1) with a value for ‘A’ (Adenine), which is represented as a quantum state of π, which is parameterized as a π/4 rotation.

In this work, in comparison to the Adenine representation in Fig. 4, Thymine is represented in Fig. 5 with a rotation of π/6. It is defined on the circuit as π/24 because of the multi-qubit gate definition. After encoding all values, we add a Hadamard gate to the strip qubit, followed by a measurement operator that will measure the value of the strip qubit, which we will then use to calculate the similarity.

Sequence comparison with Quantum

This section introduces and describes a quantum algorithm implementation for sequence comparison. Sequence comparison is vital for identifying functional regions, mutations such as polymorphisms, determining different forms of genes such as alleles that result in specific traits or diseases, and many other techniques³⁷. In the population, different alleles exist that lead to different expressions in the individual’s phenotype and ultimately result in e.g., brown or blond hair. According to the Human Genome Project³⁸, it is easier to identify mutations that cause a particular disease, leading to improved diagnoses, prevention, or therapies. One of the latest and most promising techniques is the CRISPR method of the two Nobel Prize winners, Jennifer Doudna and Emmanuelle Charpentier, also called gene scissors, that promises new possibilities against cancer, AIDS, and several hereditary diseases³⁹.

We compare two sequences in a pairwise alignment at the quantum implementation to detect patterns and mutations as in classical algorithms such as Needleman-Wunsch and Smith-Waterman; the two sequences are compared position by position^22,40.

The “similarity” approach and the technique described by Fei Yan et al.⁴¹ compare two gene sequences, which analyzes the sequence information using the strip qubit to identify which sequence pairs to reach. The method includes an evaluation of the similarities between the encoded quantum sequence representation of the same size, here replaced by the four nucleotides. A similarity value is estimated based on the probability distribution of the readouts from quantum measurements. The proposed method provides a significant speed-up compared to traditional computers as it requires less computational power⁴¹. This is due to the use of various quantum gates to transform all the information-encoded sequences into the strip. This is done by first preparing the sequences into quantum states where each value contains the index (position) in the sequence and is assigned a variable that includes A, C, T, and G. In this experiment, we applied a 2 × 2 matrix to represent a subset of the gene sequences, which are then compared to each other using a single qubit which we will refer to as a strip qubit⁴¹. The reference sequence we use to compare against different sequences will be specified by the strip0 labeled qubit.

In this example, the Adenine is represented on all four entries on the circuit, as illustrated in Fig. 6. A possible encoding strategy adopted here to map letters, the nucleic bases, is to represent all four possible positions 00, 01, 10, and 11 on the quantum circuit. Our system has a matrix with positions labeled 00, 01, 10, and 11. The first reference indicates the encoding for the next strip or sequence. The index for the following reference, strip 0, can be 0 or 1. If it is 0, it represents the reference image, whereas if it is 1, it represents the compared sequence image. This distinction between the two is necessary to avoid confusion. The positions for the following reference are represented as 000, 001, 010, and 011. The initial 0 in these positions signifies the strip. As described in our algorithm, the strip encapsulates the combination of compared sequences. In the example provided in this paper, where a single sequence is compared to another single sequence, only one strip (qubit) is needed. This strip represents either sequence 0 or sequence 1, thus serving as the data encoding component.

At strip1, the comparison sequence to the reference sequence strip0, the nucleic base T is taken, with the four possible positions, as illustrated in Fig. 7.

The process flow of the similarity search between the two gene sequences is implemented in this experiment with a similar approach as in the publication of Yan et al.⁴¹. Figure 8 shows the scheme for parallel comparison of quantum images, which reflects the process flow of the experiment in this work.

First, as shown in Fig. 8, the pre-processing step generates a quantum circuit representing each gene sequence using amplitude estimation techniques. Then, both gene sequences are compared using the pairwise comparison method, which determines the rotation difference between each sequence. This process is completed by measuring the strip qubit, which generates a snapshot, or shot, of the resulting comparison. These shots are taken 8000 times in the experiment. Finally, we view the result counts on a histogram. The result we are most concerned with is the probability of 1. This is the value we add as a parameter to determine the similarity score of the two sequences.

To determine the similarity between the sequences, we must first extrapolate the probability results of 1, P₁. We then use the P₁ result value as a parameter to determine the similarity score between sequence1 and sequence2 as shown in the following similarity equation:

$${\text{sim}}\; \, \left( {{\text{sequence}}\;{1}, \, \;{\text{sequence}}\;{2}} \right) \, = { 1 } - {\text{ 2P}}_{{1}}$$

Table 2 shows two gene sequences, columns 1 and 2, respectively, where each entry contains each specific value and their represented phase rotation angles. The differences between the four are shown in sequential order. The third column indicates the probability results of the state $|\left. {1} \right\rangle$, P₁:

Table 2 Results of the two compared sequences; comparing the two sequences on a quantum computer illustrate the varying probability results of P_1x.

Full size table

Identifying the similarity of the phase angles

This section provides evidence and the experimental results obtained with the proposed quantum algorithm identifying sequence similarity by phase angles. The experiment results for the expectation value P₁ come from the probability output by measuring the strip qubit that connects the sequences to be compared. The expectation value P₁ in the above table represents the differences between the gene sequences for each position. An important role is played by the distance between the phases, defined by the phase angles in each position. The closer the two-phase angles in each position are, the smaller the expected value or P₁.

Here P₁ is the expected value, sequence₁ is the reference sequence, and sequence₂ is the comparison sequence. After the measurement, the first thing that is determined is the probability of P₁. Using the probability of P₁, the similarity score sim (sequence1, sequence2) between the reference and comparison sequences is determined. The result from the sim equation indicates whether a change is present. If the similarity value = 1, then the two sequences are the same, whereas if the similarity value is less than 1, there is a difference between the two sequences.

In this case, the probability of getting a $|\left. {1} \right\rangle$ state result is 0.378, as shown in Fig. 9. We use this result to include in the similarity equation to determine the similarity between the two sequences. This results in the following calculation for the similarity score:

$${\text{sim}}\; \, \left( {{\text{sequence}}\;{1}, \, \;{\text{sequence}}\;{2}} \right) \, = {1 }{-}\left( {{2 }* \, 0.{378}} \right) = 0.{ 246}$$

It results in the following interpretation: A similarity score of approx. 24.6% that both gene sequences are identical, indicating a differentiation between the two sequences.

Discussion

The goal of this work was two-fold. First, to represent a gene sequence as a quantum state based on FRQI. Second, to perform a comparison, utilizing the differences between phase angles of the two gene sequences and calculating the similarity score between them. It allowed us to illustrate that a quantum system makes sequential search possible.

However, due to noise and errors, current quantum systems’ limitations made it impossible to perform a similarity score with an entire gene sequence that can contain millions upon millions of values. It is known that the human genome has about 3 billion base pairs, which is not currently possible to map completely as a quantum state. Nevertheless, this project has demonstrated that quantum computers have the potential to solve complex problems such as similarity scoring faster and with less memory in principle than classical computers. Because comparing two sequences of the size 100.000 with a 4-bit Integer will end in a memory consumption of 37 gigabytes^22,42. The Hirschberg algorithm reduces the memory consumption to a linear space of O(n)^25,26. With Quantum memory, information can be stored as a qubit superposition. The entanglement of qubits results in an exponential growth of memory. A one-dimensional sequence requires only ([log₂ n]) qubits and N bits in a classical system^19,27,43. The memory consumption is ([log₂n + log₂m]) qubits for comparing two sequences.

The approach in this work could be used on larger quantum computers in the future, expanded with more base pairs, and even analyzed with multiple gene sequences in parallel. Moreover, this approach is just one of many others which could be used for a mutation search. Some methods could be string comparison using hamming distance⁴⁴, string comparison using Grover’s search algorithm (35), or, as described in the article by Niroula, P., and Nam, Y. This quantum pattern-matching algorithm matches a search string (pattern) of length M inside a longer text of length N⁴⁶.

Quantum computers are still in the early stages and are subject to several challenges. Quantum systems have potential use in various applications in life sciences⁴⁷ or healthcare, where breakthroughs are expected soon. Quantum mechanical calculations should make it possible to quantitatively predict molecules’ properties⁴⁸.

Another essential use case is the application in the area of genomics. Big Data analytics can analyze ever-larger data generated by wearables, inside content, and eHealth apps. In addition, genetic testing is also increasingly in demand. Quantum computing and faster DNA sequencing would enable more comprehensive analyses of this data and lead to a speedier diagnosis.

Because of the complexity, we see particular potential in using quantum technologies in systems medicine⁴⁹. Diseases are complex, as we have painfully discovered in the recent pandemic. Not only the genetic code but also the microbiome, the proteome, the metabolome, or the virome may play a crucial role. All systems interact and generate a higher degree of complexity that we can hardly manage with classical computers⁵⁰. Thus, we propose “quantum systems medicine” for further medical research.

Our approach’s novelty is applying a similarity method to compare quantum images in genome sequencing using the FRQI framework. It allows for comparisons at a fine granularity, down to single letters or amino acids within gene sequences. This breakthrough enables us to capture and compare subtle genetic variations, providing a more precise examination of gene information than traditional methods.

In this article, we present algorithmic advantages regarding time and space efficiency and accuracy.

Time

The comparison process is fast and costless since it involves a single quantum gate without any control condition to simultaneously transform the entire information encoding the two quantum images. It eliminates the need for iterative calculations and reduces the computational time significantly. Moreover, the quantum measurement, which determines the quantum system’s state, is performed only at the end, contributing to time savings.

Space

The encoding of the gene sequences is given, and the number of sequences remains constant throughout the comparison process. It eliminates concerns about the scale of the encoding, making it more efficient in terms of space utilization. The encoding with different bases of the gene ensures that the number of sequences being compared does not impact the space requirements, unlike traditional methods where memory consumption grows with the size of the sequences.

Accuracy

The article builds upon existing alignment approaches used in genome sequencing, such as string-matching algorithms for exact matches and the Levenshtein distance for approximate matches. These established algorithms provide accurate results and are well-established in the field. Using quantum memory and entanglement allows for precise and reliable comparisons between sequences.

In contrast to current state-of-the-art methods, which often involve high costs and time requirements, the proposed approach offers advantages in terms of time, space, and accuracy. The Needleman–Wunsch algorithm, for example, has a time complexity of O(mn) and a memory consumption of O(nm) for comparing two sequences of size 100,000. In comparison, our approach significantly reduces both time and memory requirements. With its exponential memory growth through qubit entanglement, Quantum memory offers a more efficient representation of sequences, requiring fewer qubits than classical systems. The memory consumption for comparing two sequences is given by ([log2(n)] + log2(m)) qubits, ensuring efficient utilization of space resources.

In summary, our approach offers algorithmic advantages: reduced time complexity, improved space efficiency, and accurate sequence comparisons. The novelty lies in adapting the FRQI algorithm to perform similarity comparisons of quantum images in genome sequencing, enabling examination at the individual letter or amino acid level. This breakthrough enhances the accuracy and resolution of sequence comparisons, opening new possibilities in biological data analysis.

Materials, methods, and limitations

Limitations of quantum computers

The conducted investigation in this domain is subject to certain restrictions due to the limitations of current quantum computers. Data analysis with a real dataset could not be performed due to the circuit depth necessary to represent the data. Therefore, the similarity comparison in this work is limited to two short gene sequences. Each gene sequence has four nucleotides, primarily used as an example that can, over time, scale as the technology of quantum systems continues to evolve. There are three main challenges:

1.
Scalability: at the time of this writing, there are 433-qubit machines with short decoherence times. We used the 5-qubit machine in this study because only a subset of a gene sequence containing four nucleotides was examined.
2.
Error rates: if the circuit is too deep, that is, too many quantum operators are used and the coherence time of the qubits is exceeded, the results will contain some noise which affects the precision of the results. This is where a higher quantum volume may help in the future as we get to error correction and, eventually, logical qubits.
3.
Data complexity: because a single gene already contains many nucleotides (sequences of letters), only a part of a gene section can be replicated due to hardware limitations. This is because there are several quantum operators behind a base pair, which represent the complexity. In that case, the whole gene could be mapped on the quantum computer, and a similarity comparison for multiple gene sequences could be used with more powerful quantum computers.

Future work

Gene sequencing and analysis is a critical yet complex step in medical research, where the genes are still not fully understood. Scientists are researching and assigning the respective functions of the individual genetic building blocks. An analysis of genes represents only a tiny part of the overall complexity of a disease. Many different components are involved in holistic systems medicine, such as studying proteins, small molecules, chemical reactions, bacteria, viruses, or even the social network in which people interact. Although complexity is increased many times⁵⁰, combining natural language processing (NLP) and quantum computing could lead to new insights. NLP, part of artificial intelligence, can be applied to image and text analysis. NLP offers the possibility of performing an otherwise very time-consuming investigation much faster and examining more significant amounts of textual data. An example application is the Artificial Intelligence Double Funnel by Buchkremer et al.⁵¹.

And it is at this point that quantum computers could be of great help in analyzing these complex systems. As the increase in performance of quantum systems become available, more complex tasks could be solved with the help of quantum computers, which could include decoding the human genome. The processor architectures and performance, such as hardware quality and continuing development in error mitigation and error correction, are continuously improving.

IBM plans a 1121-qubit quantum computer for 2023 and many error suppression and mitigation techniques, which will significantly advance and enable high scaling⁵². Even though we are still at the beginning of a very long road, quantum computers will open many possibilities for us in the future that will help solve currently intractable problems.

Materials and methods

In this section, we want to provide more details about implementing the algorithm described in this work. The comparison process consists of the following steps:

Step 1 Assign each element a rotation angle; in this case, we used those defined in Table 1.
Step 2 Create a quantum circuit that includes several qubits that will represent the index number of the characters in each sequence string as a value of logN, where N is the number of characters in each sequence. In this example, we use N = 4. Therefore, log4 = 2 qubits. Then add the strip qubit; in this case, we will use a single qubit to represent the strip since we are comparing two sequences. And finally, a single qubit encodes the value of the sequence element. Finally, add a classical register to read the results of the strip qubit. Table 3 illustrates the assignment labels.
Table 3 Circuit labels and definitions. Each qubit(s) is identified to illustrate whether the qubit applies to the position or the value.
Full size table
Step 3 Set the strip and index qubits into a superposition state using Hadamard gates, as illustrated in Fig. 4.
Step 4 Encode each strip, index, and sequence value using a multi-control, single-target gate, as illustrated in Figs. 4 and 5. In this case, the gate will consist of three controls and one target, where the controls align with the strip and index qubits, and the target gate is a rotation gate aligned with the value of the sequence (either A, C, T, G), labeled DNA in the circuit.
Step 5 Set the control to capture each index position of each strip, denoted in Figs. 4 and 5.
Step 6 Add another Hadamard gate to the strip qubit to complete the circuit.
Step 7 Measure the strip qubit; this will be used to determine the difference between sequences based on the result of P₁. The basic circuit is defined as follows and represented in Figs. 4 and 5.

We now present an application of a quantum algorithm based on the Flexible Representation of Quantum Images (FRQI) applied to biological sequences. In its seminal definition, this algorithm provides a quantum representation of images that allows efficient encoding of classical data into a quantum state, i.e., color information and pixel position. Within this algorithm, encoding classical data into a quantum state requires a polynomial number of simple gates.

The idea here is to leverage the quantum encoding techniques to represent the varying nucleotide (4) (or amino acid—20) representations using the Multi-Control-RY gate (MCRY).

Here we define the quantum state |Sequence (θ)ñ as a normalized state that encodes with this formulation the genetic sequences to compare, as a function of q:

$$|Sequence(\theta )\rangle =\frac{1}{{2}^{n}}\sum_{i=0}^{{2}^{2n}-1}(\mathrm{cos}{\theta }_{i}|0\rangle + \mathrm{sin}{\theta }_{i}\left|1\rangle \right) \otimes |i\rangle$$

A simple example of a 4-character nucleotide sequence, where the quantum registers represent the strip, index (reference in the gene sequence), and the nucleotide basis value, is given below, with corresponding θ angles (nitrogen bases) and associated Kets (position encoding):

$$\begin{aligned} |{\text{Sequence}}1\rangle = & \frac{1}{2}\left[ {\left( {\cos \theta _{0} |0\rangle + \sin \theta _{0} |1\rangle } \right) \otimes |000\rangle + \left( {\cos \theta _{1} |0\rangle + \sin \theta _{1} |1\rangle } \right) \otimes |100\rangle + \left( {\cos \theta _{2} |0\rangle + \sin \theta _{2} |1\rangle } \right) \otimes |101\rangle + \left( {\cos \theta _{3} |0\rangle + \sin \theta _{3} |1\rangle } \right) \otimes |111\rangle } \right. \\ & + \left( {\cos \theta _{4} |0\rangle + \sin \theta _{4} |1\rangle } \right) \otimes |010\rangle + \left( {\cos \theta _{5} |0\rangle + \sin \theta _{5} |1\rangle } \right) \otimes |011\rangle + \left( {\cos \theta _{6} |0\rangle + \sin \theta _{6} |1\rangle } \right) \otimes |001\rangle + \left( {\cos \theta _{7} |0\rangle + \sin \theta _{7} |1\rangle } \right) \otimes |110\rangle \\ \end{aligned}$$

In this example, we have set all units of the first sequence (theta 1) to p/4 (A) and the second sequence (theta2) to p/24 (T) and obtained a probability of P1 to 0.378, which resulted in a similarity score of 0.246, or 24.6%.

Data availability

We used the 5-qubit IBM cloud machine in this study because only a subset of a gene sequence containing four nucleotides was examined. This work was performed based on the Qiskit textbook chapter on quantum image processing (available at https://learn.qiskit.org/course/ch-applications/flexible-representation-of-quantum-images-frqi).

References

Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
Article ADS CAS Google Scholar
Wang, R., Hozumi, Y., Yin, C. & Wei, G. W. Mutations on COVID-19 diagnostic targets. Genomics 112, 5204–5213 (2020).
Article CAS Google Scholar
Hasin, Y., Seldin, M. & Lusis, A. Multi-omics approaches to disease. Genome Biol. 18, 1–15 (2017).
Article Google Scholar
Preskill, J. Quantum computing in the NISQ era and beyond. Quantum 2, 79 (2018).
Article Google Scholar
Fox, D. M., MacDermaid, C. M., Schreij, A. M. A., Zwierzyna, M. & Walker, R. C. RNA folding using quantum computers. PLoS Comput. Biol. 18, e1010032 (2022).
Article ADS CAS Google Scholar
Wong, R. & Chang, W. L. Fast quantum algorithm for protein structure prediction in hydrophobic-hydrophilic model. J. Parallel Distrib. Comput. 164, 178–190 (2022).
Article Google Scholar
Wong, R. & Chang, W. L. Quantum speedup for protein structure prediction. IEEE Trans. Nanobiosci. 20, 323–330 (2021).
Article Google Scholar
Robert, A., Barkoutsos, P. K., Woerner, S. & Tavernelli, I. Resource-efficient quantum algorithm for protein folding. npj Quantum Inf. 7, 38 (2021).
Article ADS Google Scholar
Chandarana, P., Hegade, N. N., Montalban, I., Solano, E. & Chen, X. Digitized counterdiabatic quantum algorithm for protein folding. Phys. Rev. Appl., 1–14 (2022).
Nałęcz-Charkiewicz, K. & Nowak, R. M. Algorithm for DNA sequence assembly by quantum annealing. BMC Bioinformat. 23, 1–17 (2022).
Article Google Scholar
Boev, A. S. et al. Genome assembly using quantum and quantum-inspired annealing. Sci. Rep. 11, 13183 (2021).
Article ADS CAS Google Scholar
Sarkar, A., Al-Ars, Z. & Bertels, K. QuASeR: Quantum accelerated de novo DNA sequence reconstruction. PLoS ONE 16(4), e0249850. https://doi.org/10.1371/journal.pone.0249850 (2021).
Article CAS Google Scholar
Outeiral, C. et al. The prospects of quantum computing in computational molecular biology. Wiley Interdiscip. Rev. Comput. Mol. Sci. 11, 1–23 (2021).
Article Google Scholar
Grover, L. K. A fast quantum mechanical algorithm for database search. in Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing - STOC ’96 Part F1294 212–219 (ACM Press, 1996).
dos Santos, R. R. Introduction to quantum Monte Carlo simulations for fermionic systems. Brazil. J. Phys. 33, 36–54 (2003).
Article ADS Google Scholar
Liu, J. et al. Hybrid quantum-classical convolutional neural networks. Sci. China Phys. Mech. Astron. 64, 290311 (2021).
Article ADS Google Scholar
Le, P. Q., Dong, F. & Hirota, K. A flexible representation of quantum images for polynomial preparation, image compression, and processing operations. Quantum Inf. Process. 10, 63–84 (2011).
Article MathSciNet MATH Google Scholar
Liu, X., Zhou, R.-G., El-Rafei, A., Li, F.-X. & Xu, R.-Q. Similarity assessment of quantum images. Quantum Inf. Process. 18, 244 (2019).
Article ADS Google Scholar
Su, J., Guo, X., Liu, C., Lu, S. & Li, L. An improved novel quantum image representation and its experimental test on IBM quantum experience. Sci. Rep. 11, 13879 (2021).
Article ADS CAS Google Scholar
Wagner, R. A. & Fischer, M. J. The string-to-string correction problem. J. ACM 21, 168–173 (1974).
Article MathSciNet MATH Google Scholar
Levenshtein, V. I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1966).
ADS MathSciNet Google Scholar
Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970).
Article CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS Google Scholar
Lipman, D. J. & Pearson, W. R. Rapid and sensitive protein similarity searches. Science (80-) 227, 1435–1441 (1985).
Article ADS CAS Google Scholar
Hirschberg, D. S. A linear space algorithm for computing maximal common subsequences. Commun. ACM 18, 341–343 (1975).
Article MathSciNet MATH Google Scholar
Chao, K.-M., Hardison, R. C. & Miller, W. Recent developments in linear-space alignment methods: A survey. J. Comput. Biol. 1(4), 271–291. https://doi.org/10.1089/cmb.1994.1.271 (1994).
Article CAS Google Scholar
Yao, X.-W. et al. Quantum image processing and its application to edge detection: Theory and experiment. Phys. Rev. X https://doi.org/10.1103/PhysRevX.7.031041 (2017).
Article Google Scholar
Brassard, G., Høyer, P., Mosca, M. & Tapp, A. Quantum amplitude amplification and estimation. In Quantum Computation and Information (eds Lomonaco, S. J. & Brandt, H. E.) 53–74 (American Mathematical Society, 2002). https://doi.org/10.1090/conm/305/05215.
Chapter MATH Google Scholar
Zhao, Y. et al. Nucleic acids analysis. Sci. China Chem. 64, 171–203 (2021).
Article MathSciNet CAS Google Scholar
Taylor, G. K. & Stoddard, B. L. Structural, functional and evolutionary relationships between homing endonucleases and proteins from their host organisms. Nucleic Acids Res. 40, 5189–5200 (2012).
Article CAS Google Scholar
Katoh, K., Rozewicki, J. & Yamada, K. D. MAFFT online service: Multiple sequence alignment, interactive sequence choice and visualization. Brief. Bioinform. 20, 1160–1166 (2019).
Article CAS Google Scholar
Pearson, W. R. Selecting the right similarity-scoring matrix. Curr. Protoc. Bioinform. 43, 351–359 (2013).
Article Google Scholar
Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994).
Article CAS Google Scholar
Larsen, N. & Zwieb, C. SRP-RNA sequence alignment and secondary structure. Nucleic Acids Res. 19, 209–215 (1991).
Article CAS Google Scholar
Burks, J., Zwieb, C., Müller, F., Wower, I. & Wower, J. Comparative 3-D modeling of tmRNA. BMC Mol. Biol. https://doi.org/10.1186/1471-2199-6-14 (2005).
Article Google Scholar
He, Y., Luo, M.-X., Zhang, E., Wang, H.-K. & Wang, X.-F. Decompositions of n-qubit toffoli gates with linear circuit complexity. Int. J. Theor. Phys. 56, 2350–2361 (2017).
Article MathSciNet MATH Google Scholar
Buckland, P. R. The importance and identification of regulatory polymorphisms and their mechanisms of action. Biochim. Biophys. Acta Mol. Basis Dis. 1762, 17–28 (2006).
Article CAS Google Scholar
Collins, F. S. Implications of the human genome project for medical science. JAMA 285, 540 (2001).
Article CAS Google Scholar
Jiang, F. & Doudna, J. A. CRISPR–Cas9 structures and mechanisms. Annu. Rev. Biophys. 46, 505–529 (2017).
Article CAS Google Scholar
Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
Article CAS Google Scholar
Yan, F. et al. Assessing the similarity of quantum images based on probability measurements. in 2012 IEEE Congress on Evolutionary Computation 1–6 (IEEE, 2012). https://doi.org/10.1109/CEC.2012.6256418
Clote, P. & Backofen, R. Computational Molecular Biology an Introduction (Wiley, 2000).
MATH Google Scholar
Chakraborty, S., Mandal, S. B. & Shaikh, S. H. Quantum image processing: Challenges and future research issues. Int. J. Inform. Technol. 14(1), 475–489. https://doi.org/10.1007/s41870-018-0227-8 (2022).
Article Google Scholar
Khan, M. & Miranskyy, A. String comparison on a quantum computer using hamming distance. 1–19 (2021).
Menon, V. & Chattopadhyay, A. Quantum string comparison method. 1, 1–3 (2020)
Niroula, P. & Nam, Y. A quantum algorithm for string matching. npj Quantum Inform. https://doi.org/10.1038/s41534-021-00369-3 (2021).
Article Google Scholar
Yuan, X. A quantum-computing advantage for chemistry. Science (80-) 369, 1054–1055 (2020).
Article ADS CAS Google Scholar
Hempel, C. et al. Quantum chemistry calculations on a trapped-ion quantum simulator. Phys. Rev. X 8, 1–22 (2018).
CAS Google Scholar
Apweiler, R. et al. Whither systems medicine?. Exp. Mol. Med. 50, e453–e453 (2018).
Article CAS Google Scholar
Boccaletti, S., Latora, V., Moreno, Y., Chavez, M. & Hwang, D. Complex networks: Structure and dynamics. Phys. Rep. 424, 175–308 (2006).
Article ADS MathSciNet MATH Google Scholar
Buchkremer, R. et al. The application of artificial intelligence technologies as a substitute for reading and to support and enhance the authoring of scientific review articles. IEEE Access 7, 65263–65276 (2019).
Article Google Scholar
Riel, H. Quantum computing technology. in 2021 IEEE International Electron Devices Meeting (IEDM) 2021-Dec 1.3.1–1.3.7 (IEEE, 2021).

Download references

Acknowledgements

M.G. is supported by CERN through CERN Quantum Technology Initiative.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Institute of IT Management and Digitization Research (IFID), FOM University of Applied Sciences in Economics and Management, 40476, Dusseldorf, Germany
Büsra Kösoglu-Kind & Rüdiger Buchkremer
International Business Machines Corporation (IBM), Armonk, NY, 10504, USA
Robert Loredo & Jody M. Burks
IBM Quantum, IBM Thomas J. Watson Research Center, 1101 Kitchawan Rd, NY, 10598, Yorktown Heights, USA
Robert Loredo
European Organization for Nuclear Research (CERN), 1211, Geneva, Switzerland
Michele Grossi
International Business Machines Corporation (IBM), 80339, Munich, Germany
Christian Bernecker

Authors

Büsra Kösoglu-Kind
View author publications
You can also search for this author in PubMed Google Scholar
Robert Loredo
View author publications
You can also search for this author in PubMed Google Scholar
Michele Grossi
View author publications
You can also search for this author in PubMed Google Scholar
Christian Bernecker
View author publications
You can also search for this author in PubMed Google Scholar
Jody M. Burks
View author publications
You can also search for this author in PubMed Google Scholar
Rüdiger Buchkremer
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R. B. conceived the idea of searching genetic sequences as letters on quantum computers. R. B. and B.K.K. conceived the study. R. L. identified the initial concept to implement this work based on the Qiskit textbook chapter on quantum image processing for a bioinformatic application. M. G. contributed to the coding of the sequences using the methods described above. J. B. contributed her research in the field of bioinformatics, providing the context of gene sequencing and biological sequence analysis. C. B. contributed his research in DNA sequencing with a particular focus on the performance and memory consumption of traditional alignment methods such as BLAST and FASTA. B.K.K., who authored her graduate thesis on this topic and served as the motivation to write this paper. R. B. served as B.K.K.´s graduate advisor on her graduate thesis at the FOM University of Applied Sciences. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Rüdiger Buchkremer.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kösoglu-Kind, B., Loredo, R., Grossi, M. et al. A biological sequence comparison algorithm using quantum computers. Sci Rep 13, 14552 (2023). https://doi.org/10.1038/s41598-023-41086-5

Download citation

Received: 27 March 2023
Accepted: 22 August 2023
Published: 04 September 2023
DOI: https://doi.org/10.1038/s41598-023-41086-5

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.