## Introduction

Shor’s algorithm1 is a quantum algorithm that provides a way of finding the nontrivial factors of an L-bit odd composite integer $$N=pq$$ in polynomial time with high probability. The crux of Shor’s algorithm rests upon Quantum Phase Estimation (QPE)2, which is a quantum routine that estimates the phase $$\varphi _u$$ of an eigenvalue $$e^{2\pi i \varphi _u}$$ corresponding to an eigenvector $$|u\rangle$$ for some unitary matrix $${\hat{U}}$$. QPE efficiently solves a problem related to factoring, known as the order-finding problem, in polynomial time in the number of bits needed to specify the problem, which in this case is $$L=\lceil \log _{2}{N}\rceil$$. By solving the order-finding problem using QPE and carrying out a few extra steps, one can factor the integer N. There is no known classical algorithm that can solve the same problem in polynomial time2,3.

A large corpus of work has been done with regards to the experimental realization of Shor’s algorithm over the years. The pioneering work was performed with liquid-state nuclear magnetic resonance, factoring 15 on a 7-qubit quantum computer4. The considerable resource demands of Shor’s original algorithm were circumvented by using various approaches, including adiabatic quantum computing5 and in the standard network model using techniques of compilation6 that reduced the demands to within the reach of single-photon architectures7,8,9 and a super-conducting phase qubit system10. In 2012, a proof-of-concept demonstration of the order-finding algorithm for the integer 21 was carried out with photonic qubits using, in addition to the aforementioned compilation technique, an iterative scheme11, where the control register is reduced to one qubit and this qubit is reset and reused12,13. However, factoring was not possible in this demonstration due to the low number of iterations. Later, the iterative scheme was demonstrated for factoring 15, 21 and 35 on an IBM quantum processor by splitting up the iterations and combining the outcomes14. Recently, building on previous schemes of hybrid factorization15,16, a quantum-classical hybrid scheme has been implemented on IBM’s quantum processors for the prime factorization of 35. This hybrid scheme of factorization alleviates the resource requirements of the algorithm at the expense of performing part of the factoring classically17.

In this paper, we build on the order-finding routine of Ref.11 and implement a version of Shor’s algorithm for factoring 21 using only 5 qubits—the work register contains 2 qubits and the control register contains 3 qubits, each providing 1-bit of accuracy in the resolution of the peaks in the output probability distribution used to find the order. This approach is in contrast to the iterative version18 used in Refs.11 and14, which employs a single qubit that is recycled through measurement and feed-forward, giving 1-bit of accuracy each time it is recycled. The advantage of the iterative approach lies in this very reason; through mid-circuit measurement and real-time conditional feed-forward operations, the total number of qubits required by the algorithm is significantly reduced. At the time of writing, IBM’s quantum processors do not yet support real-time conditionals necessary for the implementation of the iterative approach, so we use 3 qubits for the control register, one for each effective iteration. Thus, our compact approach is completely equivalent to the iterative approach. In future, once the capability of performing real-time conditionals is added, a further reduction in resources will be possible for our implementation, potentially improving the quality of the results even more and opening up the possibility of factoring larger integers.

As it stands, the controlled-NOT ($${CX}$$) gate count of the standard algorithm19 exceeds 40 and in preliminary tests we have found that the output probability distribution is indistinguishable from a uniform probability distribution (noise) on the IBM quantum processors. Our improved version reduces the $${CX}$$ gate count through the use of relative phase Toffoli gates, reducing the $${CX}$$ gate count by half while leaving the overall operation of the circuit unchanged and we suspect this technique may extend beyond the case considered here. We have gone further than the work in Ref.11, where full factorization of 21 was not achieved as with only two bits of accuracy for the peaks of the output probability distribution continued fractions would fail to extract the correct order. On the other hand, in the work in Ref.14, where 21 was factored on an IBM processor, a larger number of 6 qubits was required and the iterations were split into three separate circuits, with the need to re-initialise the work register into specific quantum states for each iteration. Our approach is thus more efficient and compact, enabling algorithm outcomes with reduced noise. To support our claims, we successfully carry out continued fractions and evaluate the performance of the algorithm by (i) quantitatively comparing the measured probability distribution with the ideal distribution and noise via the Kolmogorov distance, (ii) performing state tomography experiments on the control register, and (iii) verifying the presence of entanglement across both registers.

The paper is organized as follows. In “Background”, we give a brief review of the order-finding problem and its relation to Shor’s algorithm. In “Compiled Shor’s algorithm” we expound on the compiled version of Shor’s algorithm, where we consider the specific case of the factorization of $$N=21$$. We construct the quantum circuits that realize the required modular exponentiation unitaries and proceed to optimize their $${CX}$$ gate count through the introduction of relative phase Toffoli gates. We report our results from executing our compact construction of the algorithm on IBM’s quantum computers in “Experiments”. Finally, we provide concluding remarks of our study in “Concluding remarks”. Supplementary information is included.

## Background

### Order finding

The order-finding problem is typically stated as follows. Given positive integers N and $$a \in \{0, 1, \ldots , N - 1\}$$ that share no common factors, we seek to find the least positive integer $$r \in \{0, 1, \ldots N\}$$ such that $$a^{r}~{\rm mod}~N = 1$$. The integer r is said to be the order of a and N, and the order-finding problem is that of finding r for a particular a and N. There exists no classical algorithm that can solve the order-finding problem efficiently, that is, with operations (elementary gates) that scale polynomially in the number of bits needed to specify N, i.e. $$L \equiv \lceil \log _{2}{N}\rceil$$2,3.

### Shor’s algorithm

The order-finding problem can be efficiently solved on a quantum computer with $${\mathcal {O}}(L^3)$$ operations; the cost being mostly due to the modular exponentiation operation which requires $${\mathcal {O}}(L^3)$$ quantum gates2. The problem of prime factorization is the subject of Shor’s algorithm, which is equivalent to the order-finding problem: for an L-bit positive odd integer $$N=pq$$ and randomly chosen positive integer $$a \le N$$ co-prime to N, the order r of a and N can be used to find the non-trivial factors of N. The algorithm is probabilistically guaranteed, with probability greater than a half that the greatest common divisor $$\gcd (a^{r/2} \pm 1, N)$$ gives the prime factors of N2. Shor’s algorithm uses two quantum registers; a control register and a work register. The control register contains n qubits, each for one bit of precision in the algorithmic output. The work register contains $$m = \lceil \log _{2}{N}\rceil$$ qubits where m is the number of qubits to encode N. The measurement of the control register outputs a probability distribution peaked at approximately the values of $$2^n s/r$$, where s is associated with the outcome of the measurement and thus randomly assigned. The details of how the peaked probability distribution comes about are given in the order-finding routine outlined below. One can determine the order r from the peak values of the distribution using continued fractions, with a number of operations that scales polynomially in $$\lceil \log _{2}{N}\rceil$$. The procedure, or routine, for order finding is summarized below.

#### Order-finding routine

1. 1.

Initialization

Prepare $$|0\rangle ^{\otimes n}|0\rangle ^{\otimes m}$$ and apply $$H^{\otimes n}$$ on the control register and X on the $$m^\text {th}$$ qubit in the work register to create a superposition of $$2^n$$ states in the control register and $$|1\rangle$$ in the work register:

\begin{aligned} |0\rangle ^{\otimes n}|0\rangle ^{\otimes m}\rightarrow \frac{1}{2^{n/2}}\displaystyle \sum _{x=0}^{2^n-1} |x\rangle |1\rangle . \end{aligned}
2. 2.

Modular exponentiation function (MEF)

Conditionally apply the unitary operation $${\hat{U}}$$ that implements the modular exponentiation function $$a^x\>\text {mod}\>N$$ on the work register whenever the control register is in state $$|x\rangle$$:

\begin{aligned} \frac{1}{2^{n/2}} \displaystyle \sum _{x=0}^{2^n -1} |x\rangle |1\rangle&\rightarrow \frac{1}{2^{n/2}}\displaystyle \sum _{x=0}^{2^n-1} |x\rangle |a^x\>\text {mod}\>N\rangle \\&= \frac{1}{\sqrt{r2^n}}\displaystyle \sum _{s=0}^{r-1}\displaystyle \sum _{x=0}^{2^n-1} e^{2\pi i s x / r}|x\rangle |u_s\rangle . \\ \end{aligned}

In the second line, $$|u_s\rangle$$ is the eigenstate of $${\hat{U}}: {\hat{U}}|u_s\rangle = e^{2\pi i s/r}|u_s\rangle$$ and $$\frac{1}{\sqrt{r}}\displaystyle \sum _{s=0}^{r-1}|u_s\rangle = |1\rangle$$ has been used for the work register. The MEF operation is equivalent to applying $${\hat{U}}^x$$ to the work register when the state $$|x\rangle$$ is in the control register, as shown in Fig. 1, with $${\hat{U}}|y\rangle =|ay~{\rm mod}~N\rangle$$ for a given state $$|y\rangle$$ (the subscript a in Û is suppressed for notational convenience). This provides an alternative way to write the output state and allows a connection between the MEF operation and the QPE algorithm for the unitary operation $${\hat{U}}$$.

3. 3.

Inverse Quantum Fourier Transform (QFT)

Apply the inverse quantum Fourier transform on the control register:

\begin{aligned} \frac{1}{\sqrt{r2^n}}\displaystyle \sum _{s=0}^{r-1}\displaystyle \sum _{x=0}^{2^n-1} e^{2\pi i s x / r}|x\rangle |u_s\rangle \rightarrow \frac{1}{\sqrt{r}}\displaystyle \sum _{s=0}^{r - 1}|\varphi _s\rangle |u_s\rangle . \end{aligned}
4. 4.

Measurements

Measure the control register in the computational basis, yielding peaks in the probability for states where $$\varphi _s \simeq 2^n s / r$$ due to the inverse QFT. Thus, the outcome of the algorithm is probabilistic, however, there is a high probability of obtaining the location of the $$\varphi _s$$ peaks after only a few runs. The accuracy of $$\varphi _s$$ to $$2^{n} s / r$$ is determined by the number of qubits in the control register.

5. 5.

Continued fractions

Apply continued fractions to $$\varphi =\varphi _s/2^n$$ (the approximation of s/r) to extract out r from the convergents (see Supplementary information VII for details).

## Compiled Shor’s algorithm

A full-scale implementation of Shor’s algorithm to factor an L-bit number would require a quantum circuit with $$72L^3$$ quantum gates acting on $$5L + 1$$ qubits for the order-finding routine20, i.e. factoring $$N=21$$ would require 9000 elementary quantum gates acting on 26 qubits. The overhead in quantum gates comes from the modular exponentiation function part of the algorithm, while the overhead in qubits comes from the level of accuracy needed to successfully carry out the continued fractions part of the algorithm. Such an overhead obviously puts a full-scale implementation beyond the reach of current devices. However, compilation techniques such as the one described in Ref.20, bridge this gap and allow for small-scale proof-of-concept demonstrations, where the quantum circuit is tailored around properties of the number to be factored. This significantly simplifies the controlled-operations that realize the MEF operation (see previous section), which is the most resource-intensive part of the order-finding routine. The resource demands of the compiled quantum circuit are significantly reduced, making it suitable for quantum devices with low connectivity.

From Ref.11, we extend the compiled quantum order-finding routine for the particular case of factoring $$N = 21$$ with $$a=4$$ to accommodate another iteration for better precision in the resolution of the peaks for the value of $$2^ns /r$$. For the case of $$N=21$$, other choices of a give 2, 4 or 6 for r. The cases for $$r=2$$ or $$r=4$$ have been demonstrated for $$N=15$$4,7,8,9,10 and would bear a similar circuit structure in the present case. With only three iterations, $$r=6$$ would be out of reach as continued fractions would fail. For $$a=4$$ we have $$r=3$$, which is a choice that does not suffer from the aforementioned reasons. Despite r being an odd integer, the algorithm is successful in finding it from $$a=4$$. This is the case for certain choices of perfect square a and odd r, and $$a=4$$ and $$r=3$$ is such a case11.

In contrast to Ref.11, our implementation is not iterative and uses three qubits for the control register rather than one qubit recycled on every iteration. The iterative version is based on the recursive phase estimation, made possible by the use of the semi-classical QFT12. However, we have used the traditional QFT because mid-circuit measurements with real-time conditionals are not possible yet on IBM’s quantum processors. The traditional QFT for 3 qubits (see2—Box 5.1) that we implemented is equivalent to Fig. 1A and Fig. 1B in Ref.13. The latter is the semi-classical QFT that makes possible the implementation of the iterative version of Shor. If mid-circuit measurements with real-time conditionals were possible, the 3-qubit semi-classical QFT would be possible and may improve the quality of the results we present here through the use of only 1 qubit for the control register, as in Ref.11. IBM has suggested that the behaviour of real-time conditionals can be reproduced through post selection of the mid-circuit measurements. However, in the present case the speed up gained would be lost using this post selection method (see Supplementary information I).

In Ref.11, a step that is unique among the compilation steps of previous demonstrations, and central to their demonstration is mapping the three levels $$|1\rangle$$, $$|4\rangle$$ and $$|16\rangle$$ accessed by the possible $$2^L=2^5$$ levels of the work-register to only a single qutrit system. In our demonstration we also use this step, however IBM processors consist of qubits and so we represent the work register by three basis states from a two-qubit system and discard the fourth basis state as a null state. The states encoding the three possible levels of the work register; $$|1\rangle$$, $$|4\rangle$$ and $$|16\rangle$$ are mapped to $$|q_0q_1\rangle$$ according to

\begin{aligned} |1\rangle&\mapsto |\log _{4}1\rangle =|00\rangle , \nonumber \\ |4\rangle&\mapsto |\log _{4}4\rangle =|01\rangle , \nonumber \\ |16\rangle&\mapsto |\log _{4}16\rangle =|10\rangle . \end{aligned}
(1)

Therefore instead of evaluating $$4^x\>\text {mod}\>21$$ in the work register as described in step 2 of “Shor’s algorithm”, the compiled version of Shor’s algorithm effectively evaluates $$\log _{4}[4^x\>\text {mod}\>21]$$ in its place for $$x = 0,1\dots 2^3-1$$20, which reduces the size of the work register to 2 qubits in comparison to the 5 qubits required in the standard construction. Note the ordering of quantum bits in the work register is $$|q\rangle = |q_0\rangle |q_1\rangle$$, where the rightmost qubit is associated with the least significant bit. Similarly, with the control register we have $$|c\rangle = |c_0\rangle |c_1\rangle |c_2\rangle$$. In total the algorithm requires 5 qubits: 3 for the control register and 2 for the work register. Implementing the controlled unitaries $${\hat{U}}^x$$ that perform the modular exponentiation $$|x\rangle |y\rangle \rightarrow |x\rangle {\hat{U}}^x|y\rangle =|x\rangle |a^x y\>\text {mod}\>N\rangle$$ reduces to effectively swapping around the states $$|1\rangle$$, $$|4\rangle$$ and $$|16\rangle$$ in the work register controlled by the corresponding bit of the integer x in the control register, which is given by $$x=c_22^0 + c_12^1 + c_02^2$$. In other words, $${\hat{U}}^x={\hat{U}}^{c_02^2}{\hat{U}}^{c_12^1}{\hat{U}}^{c_22^0}$$. Thus, depending on the control qubit $$c_i$$, one of the following maps is applied:

\begin{aligned}&{\hat{U}}^{1}: \{ |1\rangle \mapsto |4\rangle , |4\rangle \mapsto |16\rangle , |16\rangle \mapsto |1\rangle \}, \nonumber \\&{\hat{U}}^{2}: \{ |1\rangle \mapsto |16\rangle , |4\rangle \mapsto |1\rangle , |16\rangle \mapsto |4\rangle \}, \nonumber \\&{\hat{U}}^{4}: \{ |1\rangle \mapsto |4\rangle , |4\rangle \mapsto |16\rangle , |16\rangle \mapsto |1\rangle \}. \end{aligned}
(2)

The next simplification step comes from the fact that these operations on the work register need not be controlled $$SWAP$$ (Fredkin) gates, they can be as simple as $${CX}$$ gates, as we show next.

### Modular exponentiation

Implementing $${\hat{U}}^{1}$$ on the two-qubit work register is simplified considerably by noting that the states $$|4\rangle$$ and $$|16\rangle$$ initially have zero amplitude, and thus the operation $$|1\rangle \mapsto |4\rangle$$ alone is sufficient. This operation can realized with a $${CX}$$ gate controlled by $$|c_2\rangle$$ targeting the second work qubit $$|q_1\rangle$$.

(3)

Similarly, the implementation of $${\hat{U}}^{2}$$ can be simplified by noting that the states $$|1\rangle$$ and $$|4\rangle$$ are the only non-zero amplitude states in the work register after $${\hat{U}}^1$$ may have been applied, thus prompting us to only consider $$|1\rangle \mapsto |4\rangle$$ and $$|4\rangle \mapsto |16\rangle$$. A $${CX}$$ gate controlled by $$|c_1\rangle$$ targeting $$|q_1\rangle$$ followed by a Fredkin gate, swapping $$|q_0\rangle$$ and $$|q_1\rangle$$ realizes this simplified $${\hat{U}}^{2}$$.

(4)

In the above, the Fredkin gate has been decomposed into a Toffoli gate ($${CCX}$$) and two $${CX}$$ gates. The subsequent implementation of $${\hat{U}}^{4}$$ admits no simplifications as all the possible states in the work register may have non-zero amplitude at this point. This operation is implemented with a Toffoli and a Fredkin gate with single-qubit $${X}$$ gates.

(5)

The full circuit diagram is shown in Fig. 2—note that before simplification the order of application of the controlled unitaries is interchangeable, $${\hat{U}}^{2^{(n-1)}}$$ or $${\hat{U}}^{2^{1}}$$ could be applied first. Interchanging the order only has the effect of interchanging the order of the outcome bits at the end of the computation. This is the reason the order of application of the controlled unitaries here is in reverse order to that in Ref.11.

### Modular exponentiation with relative phase Toffolis

In total, the modular exponentiation routine requires three Toffoli gates; traditionally a single Toffoli gate can be decomposed into six $${CX}$$ gates and several single-qubit gates2 as follows

(6)

Taking into account a given processor’s topology and the constraints it poses, as well as other parts of the circuit (the inverse QFT), further increases the tally of $${CX}$$ gates. This becomes undesirable as it is understood that there is an upper limit on the number of $${CX}$$ gates that can be in a circuit with the guarantee of a successful computation. The number of $${CX}$$ gates from the decomposition of the Toffoli gate can be cut in half if we permit the operation to be correct up to relative phase shifts. Margolus constructed a gate that implements the Toffoli gate up to a relative phase shift of $$|101\rangle \mapsto -|101\rangle$$ that only uses three $${CX}$$ gates and four single qubit gates21. This construction has been shown to be optimal22.

(7)

Maslov showed the advantages of using a relative phase Toffoli gate when the gate is applied last or when relative phases do not matter for certain configurations of Toffolis, resulting in no overall change to the functionality in any significant way23. The configuration in the circuit shown in Fig. 2 is one such configuration that permits a replacement of Toffoli gates with Margolus gates without changing the overall functionality. All the Margolus gates in the circuit in Fig. 3 (which is the circuit in Fig. 2 with the Toffoli gates replaced by Margolus gates) never encounter the basis state $$|101\rangle$$, thus leaving the operation of the circuit unchanged. See Supplementary information II for details. This further compacting reduces the number of $${CX}$$ gates considerably and puts the algorithm within reach of current IBM processors with a limited number of noisy qubits.

## Experiments

### Physical qubit mapping

The proposed compiled circuit in Fig. 3 was mapped onto 5 physical qubits (3 control qubits and 2 work qubits) and executed on a sub-processor of IBM’s 7-qubit quantum processor ibmq_casablanca and 27-qubit quantum processor ibmq_toronto, which we will refer to as 7Q and 27Q, and whose topologies are shown in Fig. 4. When mapping the compiled circuit a few considerations can be taken into account. First, as can be seen from Eq. (7), the Margolus gate can be implemented on a collinear set of qubits, as the first control qubit need not be connected to the second control qubit. On the other hand, mapping the three-qubit inverse QFT onto physical qubits without incurring additional $$SWAP$$ gates is not possible, as the three controlled-phase gates require all three qubits to be interconnected in a triangle and the aforementioned quantum processors do not have such a topology. Additionally, more $$SWAP$$ gates are introduced to the transpiled circuit, as the processor topologies do not permit the topology required by the compiled circuit, as shown in Fig. 5.

The only possible five-qubit mappings on the quantum processors are all isomorphic to either a collinear set of qubits or a T-shaped set of qubits, as shown in Fig. 6a,b. Choosing the mapping in Fig. 6b over the one in Fig. 6a is motivated by the fact that the former is slightly more connected than latter and thus in effect would reduce the number of $$SWAP$$ gates in the mapped and transpiled circuit.

### Performance

To evaluate the performance of the algorithm, we first transpiled the circuit in Fig. 3 down to the chosen quantum processor with the mapping below

\begin{aligned}&0 \mapsto c_0, \nonumber \\&1 \mapsto c_2, \nonumber \\&4 \mapsto c_1, \nonumber \\&2 \mapsto q_1, \nonumber \\&3 \mapsto q_0. \end{aligned}
(8)

Through the transpiler’s optimization, with the mapping above it is possible to have a circuit that has 25 $${CX}$$ gates and a circuit depth of 35. Figure 7 shows the results of measurements on the control register qubits from the two processors, where measurement error mitigation has been applied to results and mitigates the effect of measurement errors on the raw results (see Supplementary information III). The outcomes $$|101\rangle$$ and $$|110\rangle$$ occur with probability $$\sim 14\%$$ and $$\sim 16\%$$ respectively. The theoretical ideal probability is $$\sim 25\%$$, as can be seen from the simulator results in Fig. 7. However, the amplification of the peaks $$|000\rangle$$, $$|101\rangle$$ and $$|110\rangle$$ is clearly visible from the processor outcomes.

We quantify the successful performance of the algorithm by comparing the experimental and ideal probability distributions via the trace distance or Kolmogorov distance2, which measures the closeness of two discrete probability distributions P and Q and is defined by the equation $$D(P,Q) \equiv \sum _{x \in {\mathcal {X}}}|P(x) - Q(x)|/2$$, where $${{{\mathcal {X}}}}$$ represents all possible outcomes. This measure shows an agreement between measured and ideal results—the trace distance between the measured distribution and the ideal distribution is 0.1694 and 0.1784 for ibmq_toronto and ibmq_casablanca, respectively. On the other hand, the trace distance between the ideal distribution and a candidate random uniform distribution is 0.4347. Furthermore, we evaluate the performance of the algorithm by characterizing the measured output state in the control register, this is achieved via state tomography yielding the density matrix of the measured state. The measured state and ideal state on the output register are quantitatively compared using the fidelity for two quantum states $$\rho$$ and $$\sigma$$, and is defined to be $$F(\rho , \sigma ) \equiv \text {tr}\sqrt{\rho ^{1/2}\sigma \rho ^{1/2}}$$2. We measured a fidelity of $$F(\rho _\text {id}, \rho _{27Q})=0.6948 \pm 00650$$ and $$F(\rho _\text {id}, \rho _{7Q}) = 0.70 \pm 0.0275$$ on the 27 qubit and 7 qubit quantum processors respectively, as shown in Fig. 8. In Fig. 9 we show the estimated density matrices in the computational basis for each respective device.

### Factoring $$N=21$$

The measured probability distributions in Fig. 7 are peaked in probability for the outcomes $$000~(\varphi _s=0)$$, $$101~(\varphi _s=5)$$ and $$110~(\varphi _s=6)$$, with ideal probabilities of 0.35, 0.25 and 0.25, respectively. Here we are using the integer representation of the binary outcomes. The outcome 000 corresponds to a failure of the algorithm11. For the outcome 101, computing the continued fraction expansion of $$\varphi =\varphi _s/2^n=5/2^3=5/8$$ gives the convergents $$\{0, 1, 1/2, 2/3, 5/8\}$$ (see Supplementary information VII for details), so that the third convergent 2/3 in the expansion can be identified as s/r and correctly gives $$r=3$$ as the order when tested with the relation $$a^r~{\rm mod}~N=1$$, while the other convergents do not give an r that passes the test. On the other hand, the continued fraction expansion of $$\varphi =6/8$$ gives $$\{0, 1, 3/4\}$$ and incorrectly gives $$r=4$$ as the order (see Supplementary information VII for details). This failure can be avoided in principle by adding further qubits to the control register so that the peak in the probability distribution becomes narrower and more well defined11. Another option is to simply apply continued fractions to all peaked outcomes and test if the value of r found satisfies the order relation for a and N. It is interesting to note that from the results of Ref.11, successfully finding the order $$r=3$$ was not possible to achieve, as with only two bits of accuracy in the experiment the continued fractions would always fail due to the peaked outcomes of 10 (2) and 11 (3) giving the convergents of $$\{0,1/2\}$$ and $$\{0,1,3/4\}$$, respectively. In our case, we successfully find $$r=3$$, from which we obtain $$\mathrm{gcd}(a^{r/2}\pm 1,N)=\mathrm{gcd}(8\pm 1,21)=3$$ and 7. Thus, with our demonstration, extending the number of outcome bits to three has allowed us to fully perform the quantum factoring of $$N=21$$.

### Verification of entanglement

The presence of entanglement between the control and work registers is known to be a requirement for the algorithm to gain any advantageous speedup over its classical counterparts in general6,24,25. For detecting genuine multipartite entanglement around the vicinity of an ideal state $$|\psi \rangle$$, one can construct a projector-based witness such as the one below:

\begin{aligned} \hat{{\mathcal {W}}}_{\psi } = \alpha {\mathbb {I}} - |\psi \rangle \langle \psi |, \end{aligned}
(9)

where $$\alpha$$ is the square of the maximum overlap between $$|\psi \rangle$$ and all biseparable states. In other words, $$\text {tr}(\hat{{\mathcal {W}}}_{\psi }\rho ) \ge 0$$ for biseparable states and $$\text {tr}(\hat{{\mathcal {W}}}_{\psi }\rho ) < 0$$ for states with genuine multipartite entanglement in the vicinity of $$|\psi \rangle$$26. For the ideal state after modular exponentiation (but before the inverse QFT) in both the control and work registers, $$\alpha =0.75$$ was found using the method described in the appendix of Ref.26. This was implemented using the software package QUBIT4MATLAB27. Therefore ideally the state in both registers after modular exponentiation has genuine multipartite entanglement.

In order to check whether the output state from the IBM processors is close to the ideal state and has genuine multipartite entanglement, full state tomography would normally be needed to characterize the state $$\rho _\text {exp}$$ in both the control and work registers. This would require $$3^5$$ measurements, making it impractical to gather a sufficiently large data set within a meaningful time frame. However, we need not measure the full density matrix, the quantity $$\text {tr}(|\Psi \rangle \langle \Psi |\rho _\text {exp})$$ suffices. To measure this, we can decompose $$\rho =|\Psi \rangle \langle \Psi |$$ into 293 Pauli expectations as

\begin{aligned} |\Psi \rangle \langle \Psi | = \displaystyle \sum _{ijklm} p_{ijklm} \sigma _i^{(1)}\sigma _j^{(2)}\sigma _k^{(3)}\sigma _l^{(4)}\sigma _m^{(5)}, \end{aligned}
(10)

where $$\sigma _{i} = \{I, X, Y, Z\}$$ are the usual Pauli matrices plus the identity. However, the number of measurements needed to obtain all 293 expectation values can be reduced28. This is because the measured probabilities from a measurement of a single Pauli expectation value, i.e. $$\langle ZZZZZ\rangle$$, can be summed in various combinations to derive other Pauli expectations values, i.e. $$\langle ZIZZZ\rangle , \langle IZZZZ\rangle$$, etc. The values derived are nothing but the marginalization of the measured probabilities over the outcome space of some set of qubits (see Supplementary information V for details). We can do the same for each term in the set of terms from the Pauli decomposition of $$\rho$$, calling it $${\mathcal {S}}_d$$, forming a set of other Pauli terms that can be derived from the same counts. Taking the union of these sets to be $${\mathcal {S}}_u$$, the complement $${\mathcal {S}}_d\backslash {\mathcal {S}}_u$$ gives the 79 terms we only need to measure (see Supplementary information V). We measure the 79 Pauli expectation values of the terms above with respect to the state in both registers after modular exponentiation and from this we compute/derive the 293 terms in $${\mathcal {S}}_d$$ and therefore $$\text {tr}(|\Psi \rangle \langle \Psi |\rho _\text {exp})$$. The measured probabilities for each term, some of them shown in Fig. 10, result in an expectation value of $$\text {tr}(|\Psi \rangle \langle \Psi |\rho _{7Q}) = 0.677 \pm 0.00365$$ and $$\text {tr}(|\Psi \rangle \langle \Psi |\rho _{27Q}) = 0.626 \pm 0.00304$$, which leads to

\begin{aligned} \text {tr}(\hat{{\mathcal {W}}}_{\Psi }\rho _{7Q})&= 0.0729 \pm 0.00365,\nonumber \\ \text {tr}(\hat{{\mathcal {W}}}_{\Psi }\rho _{27Q})&= 0.124 \pm 0.00304. \end{aligned}
(11)

The results obviously fail to detect genuine multipartite entanglement, however, this does not mean entanglement is entirely absent. Consider the square of the maximum overlap between the ideal state $$|\Psi \rangle$$ and all pure states $$|\theta \rangle$$ that are unentangled product states with respect to some bipartite partition (bipartition) $${\mathcal {B}}$$ of the qubits,

\begin{aligned} \underset{\theta \in {\mathcal {B}}}{\max }{|\langle {\theta }|{\Psi }\rangle |^2} = \beta _{\Psi }. \end{aligned}
(12)

Thus, any other state $$|\xi \rangle$$ for which

\begin{aligned} |\langle {\xi }|{\Psi }\rangle |^2 > \beta _{\Psi } \end{aligned}
(13)

cannot be a product state with respect to the bipartition $${\mathcal {B}}$$, implying that there is non-separability, or entanglement, across this bipartition. The above result extends to mixed states $$\rho _\xi$$ due to the convex sum nature of mixed quantum states27. We compute Eq. (12) for all possible bipartitions of our ideal state $$|\Psi \rangle$$ (see Supplementary information VI for more details).

For the experimental state $$\rho _{7Q}$$ we find, with the exception of the bipartition $${{{\mathcal {B}}}}=(c_0c_1c_2q_1)(q_0)$$, that it is non-separable with respect to all other bipartitions, i.e. the square of the overlap between $$\rho _{7Q}$$ and $$|\Psi \rangle$$ ($$\sim 0.677$$) is greater than the maximal square overlap between $$|\Psi \rangle$$ and all product states in each of these bipartitions. Similarly for $$\rho _{27Q}$$, with the exception of bipartitions $${{{\mathcal {B}}}}=(c_0c_1c_2q_1)(q_0)$$ and $${{{\mathcal {B}}}}=(c_0c_1c_2q_0)(q_1)$$, the state is non-separable with respect to all other bipartitions. Most notably, both $$\rho _{7Q}$$ and $$\rho _{27Q}$$ are non-separable with respect to the bipartition $${{{\mathcal {B}}}}=(c_0c_1c_2)(q_0q_1)$$, which is a bipartition between the control and work registers. This implies that non-separability or entanglement is present between the registers, as required for the algorithm’s speedup in general6,24,25. Furthermore, the maximum (not necessarily global but a good proxy of it) expectation value of the operator $$|\Psi \rangle \langle \Psi |$$ for product states, is found via a greedy search algorithm27 to be around 0.30, further asserting that indeed the qubits are entangled with each other in some way.

## Concluding remarks

In summary, we have implemented a compiled version of Shor’s algorithm on IBM’s quantum processors for the prime factorization of 21. By using relative phase shift Toffoli gates, we were able to reduce the resource demands that would have been required in the standard compiled and non-iterative construction of Shor’s algorithm (with regular Toffoli gates), and still preserve its functional correctness. The use of relative phase shift Toffoli gates has also allowed us to extend the implementation in Ref.11 to an increased resolution. Moreover, while the latter implementation used only 1 recycled qubit for the control register, in contrast to our three qubits, it falls one iteration short of achieving full factoring for the reasons already mentioned. It is not clear what additional resource overheads (single and two-qubit gates) would be needed in implementing another iteration in their scheme and it is likely that these overheads are what prevented the full factoring of 21 in the photonic setup used. Furthermore, we note that in principle there is no real advantage in using three qubits for the control register as we have done here instead of one qubit recycled, as in Ref.11. However, in practice it is not possible at present to recycle qubits on the IBM processors and so we used three qubits instead. In future, once this capability is added, a further reduction in resources will be possible for our implementation, potentially improving the quality of the results even more.

We have verified, via state tomography, the output state in the control register for the algorithm, achieving a fidelity of around 0.70. For the verification of entanglement generated during the algorithm’s operation, the resource demands of state tomography were circumvented by measuring a much reduced number of Pauli measurements to uniquely identify a quantum state28. However, this method is quite specialized and cannot be easily generalized to larger systems. In scaling up Shor’s algorithm to higher integers beyond 21 using larger quantum systems, other methods of quantum tomography can be used to characterize the performance. These include compressed sensing29 and classical shadows30, which give theoretical guarantees, and improved scaling in the number of Pauli measurements and classical post-processing than standard methods. In the case when the state belongs to a class of states with certain symmetries, such as stabilizer states, only a few measurements are required for measuring the fidelity and detecting multipartite entanglement31. However, not all entangled states are neatly housed within these well-studied classes. Ref.32 introduces a device-independent method for multipartite entanglement detection which scales polynomially with the system size by relaxing some constraints. Another scheme constructs witnesses that require a constant number of measurements of the system size at the cost of robustness against white noise. This provides a fast and simple procedure for entanglement detection33. Many fundamental questions on the subjects of quantum tomography and multipartite entanglement still remain to be answered34 and advances will help in efficiently quantifying the performance of algorithms in larger quantum processors.

Our demonstration involves a two-fold reduction of the resource count from the full circuit in Fig. 2 via the replacement of regular Toffoli gates with relative phase variants, which is an approach that is in the spirit of the NISQ era; tailoring quantum circuits to circumvent the shortcomings of noisy quantum processors. In addition, we suspect that we can further reduce the resource count through the use of the approximate QFT35, while still maintaining a clear resolution of the peaks in the output probability distribution. A possible avenue of future research derived from what we have reported here is the investigation and identification of scenarios where one can replace Toffoli gates with relative phase Toffoli gates while preserving the functional correctness, in a wide range of algorithms including Shor’s algorithm, as seen here. In the present case, whether such an approach is special to the case of $$N=21$$ or extendable to other N is not known. Ref.23 has performed some work in this regard, however a proper analysis and systematic composition of relative phase Toffoli gates for such purposes is still an open problem. In future, a similar approach may make possible the factorization of larger numbers with adequate accuracy in resolution of the algorithm’s outcomes and their characterization.