Author Correction: Demonstration of Shor’s factoring algorithm for N = 21 on IBM quantum processors

We report a proof-of-concept demonstration of a quantum order-finding algorithm for factoring the integer 21. Our demonstration involves the use of a compiled version of the quantum phase estimation routine, and builds upon a previous demonstration. We go beyond this work by using a configuration of approximate Toffoli gates with residual phase shifts, which preserves the functional correctness and allows us to achieve a complete factoring of [Formula: see text]. We implemented the algorithm on IBM quantum processors using only five qubits and successfully verified the presence of entanglement between the control and work register qubits, which is a necessary condition for the algorithm's speedup in general. The techniques we employ may be useful in carrying out Shor's algorithm for larger integers, or other algorithms in systems with a limited number of noisy qubits.


I. INTRODUCTION
Shor's algorithm [1] is a quantum algorithm that provides a way of finding the nontrivial factors of an L-bit odd composite integer N = pq in polynomial time with high probability.The crux of Shor's algorithm rests upon Quantum Phase Estimation (QPE) [2], which is a quantum routine that estimates the phase ϕ u of an eigenvalue e 2πiϕu corresponding to an eigenvector |u for some unitary matrix Û .QPE efficiently solves a problem related to factoring, known as the order-finding problem, in polynomial time in the number of bits needed to specify the problem, which in this case is L = log 2 N .By solving the order-finding problem using QPE and carrying out a few extra steps, one can factor the integer N .There is no known classical algorithm that can solve the same problem in polynomial time [2,3].
A large corpus of work has been done with regards to the experimental realization of Shor's algorithm over the years.The pioneering work was performed with liquid-state nuclear magnetic resonance, factoring 15 on a 7-qubit quantum computer [4].The considerable resource demands of Shor's original algorithm were circumvented by using various approaches, including adiabatic quantum computing [5] and in the standard network model using techniques of compilation [6] that reduced the demands to within the reach of single-photon architectures [7][8][9] and a super-conducting phase qubit system [10].In 2012, a proof-of-concept demonstration of the order-finding algorithm for the integer 21 was carried out with photonic qubits using, in addition to the aforementioned compilation technique, an iterative scheme [11], where the control register is reduced to one qubit and this qubit is reset and reused [12,13].However, factoring was not possible in this demonstration due to the low number of iterations.Later, the iterative scheme was demonstrated for factoring 15, 21 and 35 on an IBM quantum processor by splitting up the itera-tions and combining the outcomes [14].Recently, building on previous schemes of hybrid factorization [15,16], a quantum-classical hybrid scheme has been implemented on IBM's quantum processors for the prime factorization of 35.This hybrid scheme of factorization alleviates the resource requirements of the algorithm at the expense of performing part of the factoring classically [17].
In this paper, we build on the order-finding routine of Ref. [11] and implement a version of Shor's algorithm for factoring 21 using only 5 qubits -the work register contains 2 qubits and the control register contains 3 qubits, each providing 1-bit of accuracy in the resolution of the peaks in the output probability distribution used to find the order.This approach is in contrast to the iterative version [18] used in Refs.[11] and [14], which employs a single qubit that is recycled through measurement and feed-forward, giving 1-bit of accuracy each time it is recycled.The advantage of the iterative approach lies in this very reason; through mid-circuit measurement and real-time conditional feed-forward operations, the total number of qubits required by the algorithm is significantly reduced.At the time of writing, IBM's quantum processors do not yet support real-time conditionals necessary for the implementation of the iterative approach, so we use 3 qubits for the control register, one for each effective iteration.Thus, our compact approach is completely equivalent to the iterative approach.In future, once the capability of performing real-time conditionals is added, a further reduction in resources will be possible for our implementation, potentially improving the quality of the results even more and opening up the possibility of factoring larger integers.
As it stands, the controlled-NOT (cX) gate count of the standard algorithm [19] exceeds 40 and in preliminary tests we have found that the output probability distribution is indistinguishable from a uniform probability distribution (noise) on the IBM quantum processors.Our improved version reduces the cX gate count through the use of relative phase Toffoli gates, reducing the cX gate count by half while leaving the overall operation of the circuit unchanged and we suspect this technique may extend beyond the case considered here.We have gone further than the work in Ref. [11], where full factorization of 21 was not achieved as with only two bits of accuracy for the peaks of the output probability distribution continued fractions would fail to extract the correct order.On the other hand, in the work in Ref. [14], where 21 was factored on an IBM processor, a larger number of 6 qubits was required and the iterations were split into three separate circuits, with the need to re-initialise the work register into specific quantum states for each iteration.Our approach is thus more efficient and compact, enabling algorithm outcomes with reduced noise.To support our claims, we successfully carry out continued fractions and evaluate the performance of the algorithm by (i) quantitatively comparing the measured probability distribution with the ideal distribution and noise via the Kolmogorov distance, (ii) performing state tomography experiments on the control register, and (iii) verifying the presence of entanglement across both registers.
The paper is organized as follows.In Sec.II, we give a brief review of the order-finding problem and its relation to Shor's algorithm.In Sec.III we expound on the compiled version of Shor's algorithm, where we consider the specific case of the factorization of N = 21.We construct the quantum circuits that realize the required modular exponentiation unitaries and proceed to optimize their cX gate count through the introduction of relative phase Toffoli gates.We report our results from executing our compact construction of the algorithm on IBM's quantum computers in Sec.IV.Finally, we provide concluding remarks of our study in Sec.V.An appendix is also included.

A. Order finding
The order-finding problem is typically stated as follows.Given positive integers N and a ∈ {0, 1, . . ., N − 1} that share no common factors, we seek to find the least positive integer r ∈ {0, 1, . . .N } such that a r mod N = 1.The integer r is said to be the order of a and N , and the order-finding problem is that of finding r for a particular a and N .There exists no classical algorithm that can solve the order-finding problem efficiently, that is, with operations (elementary gates) that scale polynomially in the number of bits needed to specify N , i.e.L ≡ log 2 N [2, 3].

B. Shor's algorithm
The order-finding problem can be efficiently solved on a quantum computer with O(L 3 ) operations; the cost being mostly due to the modular exponentiation operation which requires O(L 3 ) quantum gates [2].The problem of prime factorization is the subject of Shor's algorithm, which is equivalent to the order-finding problem: for an FIG. 1. Schematic of the routine used for the period finding part of Shor's algorithm.The first (control) register has n qubits.The number of qubits in the control register determines the bit-accuracy of the value of 2 n s/r.The bottom (work) register has the m qubits required to encode N .First, the control and work registers are initialized, then conditional modular exponentiation is performed, indicated by the controlled unitary and an inverse quantum Fourier transform is applied to the control register followed by a standard computational basis measurement.The circuit is essentially the QPE algorithm applied to the unitary matrix Ûa -see text for details.
L-bit positive odd integer N = pq and randomly chosen positive integer a ≤ N co-prime to N , the order r of a and N can be used to find the non-trivial factors of N .The algorithm is probabilistically guaranteed, with probability greater than a half that the greatest common divisor gcd(a r/2 ± 1, N ) gives the prime factors of N [2].Shor's algorithm uses two quantum registers; a control register and a work register.The control register contains n qubits, each for one bit of precision in the algorithmic output.The work register contains m = log 2 N qubits where m is the number of qubits to encode N .The measurement of the control register outputs a probability distribution peaked at approximately the values of 2 n s/r, where s is associated with the outcome of the measurement and thus randomly assigned.The details of how the peaked probability distribution comes about are given in the order-finding routine outlined below.One can determine the order r from the peak values of the distribution using continued fractions, with a number of operations that scales polynomially in log 2 N .The procedure, or routine, for order finding is summarized below.
Order-finding routine

Initialization
Prepare |0 ⊗n |0 ⊗m and apply H ⊗n on the control register and X on the m th qubit in the work register to create a superposition of 2 n states in the control register and |1 in the work register:

Modular exponentiation function (MEF)
Conditionally apply the unitary operation Û that implements the modular exponentiation function a x modN on the work register whenever the control register is in state |x :

Measurements
Measure the control register in the computational basis, yielding peaks in the probability for states where ϕ s 2 n s/r due to the inverse QFT.Thus, the outcome of the algorithm is probabilistic, however, there is a high probability of obtaining the location of the ϕ s peaks after only a few runs.The accuracy of ϕ s to 2 n s/r is determined by the number of qubits in the control register.

Continued fractions
Apply continued fractions to ϕ = ϕ s /2 n (the approximation of s/r) to extract out r from the convergents (see Appendix G for details).

III. COMPILED SHOR'S ALGORITHM
A full-scale implementation of Shor's algorithm to factor an L-bit number would require a quantum circuit with 72L 3 quantum gates acting on 5L + 1 qubits for the order-finding routine [20], i.e. factoring N = 21 would require 9000 elementary quantum gates acting on 26 qubits.The overhead in quantum gates comes from the modular exponentiation function part of the algorithm, while the overhead in qubits comes from the level of accuracy needed to successfully carry out the continued fractions part of the algorithm.Such an overhead obviously puts a full-scale implementation beyond the reach of current devices.However, compilation techniques such as the one described in Ref. [20], bridge this gap and allow for smallscale proof-of-concept demonstrations, where the quantum circuit is tailored around properties of the number to be factored.This significantly simplifies the controlledoperations that realize the MEF operation (see previous section), which is the most resource-intensive part of the order-finding routine.The resource demands of the compiled quantum circuit are significantly reduced, making it suitable for quantum devices with low connectivity.
From Ref. [11], we extend the compiled quantum orderfinding routine for the particular case of factoring N = 21 with a = 4 to accommodate another iteration for better precision in the resolution of the peaks for the value of 2 n s/r.For the case of N = 21, other choices of a give 2, 4 or 6 for r.The cases for r = 2 or r = 4 have been demonstrated for N = 15 [4,[7][8][9][10] and would bear a similar circuit structure in the present case.With only three iterations, r = 6 would be out of reach as continued fractions would fail.For a = 4 we have r = 3, which is a choice that does not suffer from the aforementioned reasons.Despite r being an odd integer, the algorithm is successful in finding it from a = 4.This is the case for certain choices of perfect square a and odd r, and a = 4 and r = 3 is such a case [11].
In contrast to Ref. [11], our implementation is not iterative and uses three qubits for the control register rather than one qubit recycled on every iteration.The iterative version is based on the recursive phase estimation, made possible by the use of the semi-classical QFT [12].However, we have used the traditional QFT because midcircuit measurements with real-time conditionals are not possible yet on IBM's quantum processors.The traditional QFT for 3 qubits (see [2] -Box 5.1) that we implemented is equivalent to Fig. 1A and Fig. 1B in Ref. [13].The latter is the semi-classical QFT that makes possible the implementation of the iterative version of Shor.If mid-circuit measurements with real-time conditionals were possible, the 3-qubit semi-classical QFT would be possible and may improve the quality of the results we present here through the use of only 1 qubit for the control register, as in Ref. [11].IBM has suggested that the behaviour of real-time conditionals can be reproduced through post selection of the mid-circuit measurements.However, in the present case the speed up gained would be lost using this post selection method (see Appendix A).
In Ref. [11], a step that is unique among the compilation steps of previous demonstrations, and central to their demonstration is mapping the three levels |1 , |4 and |16 accessed by the possible 2 L = 2 5 levels of the work-register to only a single qutrit system.In our demonstration we also use this step, however IBM processors consist of qubits and so we represent the work register by 3 basis states from a two-qubit system and discard the fourth basis state as a null state.The states encoding the three possible levels of the work register; |1 , |4 and |16 are mapped to |q 0 q 1 according to Therefore instead of evaluating 4 x mod 21 in the work register as described in step 2 of Sec.II B, the compiled version of Shor's algorithm effectively evaluates log 4 [4 x mod 21] in its place for x = 0, 1 . . . 2 3 − 1 [20], which reduces the size of the work register to 2 qubits in comparison to the 5 qubits required in the standard construction.Note the ordering of quantum bits in the work register is |q = |q 0 |q 1 , where the rightmost qubit is associated with the least significant bit.Similarly, with the control register we have |c = |c 0 |c 1 |c 2 .In total the algorithm requires 5 qubits: 3 for the control register and 2 for the work register.Implementing the controlled unitaries Û x that perform the modular exponentiation |x |y → |x Û x |y = |x |a x y mod N reduces to effectively swapping around the states |1 , |4 and |16 in the work register controlled by the corresponding bit of the integer x in the control register, which is given by x = c 2 2 0 + c 1 2 1 + c 0 2 2 .In other words, Û x = Û c02 2 Û c12 1 Û c22 0 .Thus, depending on the control qubit c i , one of the following maps is applied: The next simplification step comes from the fact that these operations on the work register need not be controlled SWAP (Fredkin) gates, they can be as simple as cX gates, as we show next.

A. Modular exponentiation
Implementing Û 1 on the two-qubit work register is simplified considerably by noting that the states |4 and |16 initially have zero amplitude, and thus the operation |1 → |4 alone is sufficient.This operation can realized with a cX gate controlled by |c 2 targeting the second work qubit |q 1 .
Similarly, the implementation of Û In the above, the Fredkin gate has been decomposed into a Toffoli gate (ccX) and two cX gates.The subsequent implementation of Û 4 admits no simplifications as all the possible states in the work register may have non-zero amplitude at this point.This operation is implemented with a Toffoli and a Fredkin gate with single-qubit X gates.
The full circuit diagram is shown in Fig. 2 -note that before simplification the order of application of the controlled unitaries is interchangeable, Û 2 (n−1) or Û 2 1 could be applied first.Interchanging the order only has the effect of interchanging the order of the outcome bits at the end of the computation.This is the reason the order of application of the controlled unitaries here is in reverse order to that in Ref. [11].

B. Modular exponentiation with relative phase Toffolis
In total, the modular exponentiation routine requires three Toffoli gates; traditionally a single Toffoli gate can be decomposed into six cX gates and several single-qubit gates [2] as follows Taking into account a given processor's topology and the constraints it poses, as well as other parts of the circuit (the inverse QFT), further increases the tally of cX gates.This becomes undesirable as it is understood that there is an upper limit on the number of cX gates that can be in a circuit with the guarantee of a successful computation.The number of cX gates from the decomposition of the Toffoli gate can be cut in half if we permit the operation to be correct up to relative phase shifts.Margolus constructed a gate that implements the Toffoli gate up to a relative phase shift of |101 → − |101 that only uses three cX gates and four single qubit gates [21].This construction has been shown to be optimal [22].
The advantages of relative phase Toffoli gates extend beyond the commonly conceived scenarios, i.e. when the gate is applied last or when the relative phase shifts do not matter for certain configurations of multiplycontrolled Toffoli gates.Maslov reported circuit identities that permit the replacement of Toffoli gates with their relative phase variants in certain configurations, resulting in no overall change to the functionality in any significant way [23].The configuration in the circuit shown in Fig. 2 is one such configuration that permits a replacement of Toffoli gates with Margolus gates without changing the overall functionality.All the Margolus gates in the circuit in Fig. 3 (which is the circuit in Fig. 2 with the Toffoli gates replaced by Margolus gates) never encounter the basis state |101 , thus leaving the operation of the circuit unchanged.See Appendix B for de-tails.This further compacting reduces the number of cX gates considerably and puts the algorithm within reach of current IBM processors with a limited number of noisy qubits.

IV. EXPERIMENTS A. Physical qubit mapping
The proposed compiled circuit in Fig. 3 was mapped onto 5 physical qubits (3 control qubits and 2 work qubits) and executed on a sub-processor of IBM's 7qubit quantum processor ibmq casablanca and 27qubit quantum processor ibmq toronto, which we will refer to as 7Q and 27Q, and whose topologies are shown in Fig. 4. When mapping the compiled circuit a few considerations can be taken into account.First, as can be seen from Eq. ( 7), the Margolus gate can be implemented on a collinear set of qubits, as the first control qubit need not be connected to the second control qubit.On the other hand, mapping the three-qubit inverse QFT onto physical qubits without incurring additional SWAP gates is not possible, as the three controlled-phase gates require all three qubits to be interconnected in a triangle and the aforementioned quantum processors do not have such a topology.Additionally, more SWAP gates are introduced to the transpiled circuit, as the processor topologies do not permit the topology required by the compiled circuit, as shown in Fig. 5.
The only possible five-qubit mappings on the quantum processors are all isomorphic to either a collinear set of qubits or a T-shaped set of qubits, as shown in Fig. 6 a  and b.Choosing the mapping in Fig. 6 b over the one in Fig. 6 a is motivated by the fact that the former is slightly more connected than latter and thus in effect would reduce the number of SWAP gates in the mapped and transpiled circuit.

B. Performance
To evaluate the performance of the algorithm, we first transpiled the circuit in Fig. 3 down to the chosen quantum processor with the mapping below Through the transpiler's optimization, with the mapping above it is possible to have a circuit that has 25 cX gates and a circuit depth of 35.Fig. 7 shows the results of measurements on the control register qubits from the two processors, where measurement error mitigation has been applied to results and mitigates the effect of measurement errors on the raw results (see Appendix C).The FIG. 5. Qubit connections required by the compiled circuit in Fig. 3.
outcomes |011 and |101 occur with probability ∼ 16% and ∼ 19% on ibmq toronto and ∼ 18% and ∼ 17% on ibmq casablanca, respectively.The theoretical ideal probability is ∼ 25%, as can be seen from the simulator results in Fig. 7.However, the amplification of the peaks |000 , |011 and |101 is clearly visible from the processor outcomes.
We quantify the successful performance of the algorithm by comparing the experimental and ideal probability distributions via the trace distance or Kolmogorov distance [2], which measures the closeness of two discrete probability distributions P and Q and is defined by the equation where X represents all possible outcomes.This measure shows an agreement between measured and ideal results -the trace distance between the measured distribution and the ideal distribution is 0.1694 and 0.1784 for ibmq toronto and ibmq casablanca, respectively.On the other hand, the trace distance between the ideal distribution and a candidate random uniform distribution is 0.4347.Furthermore, we evaluate the performance of the algorithm by characterizing the measured output state in the control register, this is achieved via state tomography yielding the density matrix of the measured state.The measured state and ideal state on the output register are quantitatively compared using the fidelity for two quantum states ρ and σ, and is defined to be F (ρ, σ) ≡ tr ρ 1/2 σρ 1/2 [2].We measured a fidelity of F (ρ id , ρ 27Q ) = 0.6948 ± 00650 and F (ρ id , ρ 7Q ) = 0.70 ± 0.0275 on the 27 qubit and 7 qubit quantum processors respectively, as shown in Fig. 8.In Fig. 9 we show the estimated density matrices in the computational basis for each respective device.

C. Factoring N = 21
The measured probability distributions in Fig. 7 are peaked in probability for the outcomes 000 (ϕ s = 0), 011 (ϕ s = 3) and 101 (ϕ s = 5), with ideal probabilities of 0.35, 0.25 and 0.25, respectively.Here we are using the integer representation of the binary outcomes.The outcome 000 corresponds to a failure of the algorithm [11].For the outcome 011, computing the continued fraction expansion of ϕ = ϕ s /2 n = 3/2 3 = 3/8 gives the conver-FIG.6.The two possible 5-qubit processor mappings on the architectures shown in Fig. 4.
gents {0, 1/2, 1/3, 3/8} (see Appendix G for details), so that the third convergent 1/3 in the expansion can be identified as s/r and correctly gives r = 3 as the order when tested with the relation a r mod N = 1, while the other convergents do not give an r that passes the test.Also, for 101, computing the continued fraction expansion of ϕ = ϕ s /2 n = 5/2 3 = 5/8 gives the convergents {0, 1, 1/2, 2/3, 5/8} (see Appendix G for details), so that the third convergent 2/3 in the expansion can be identified as s/r and correctly gives r = 3 as the order, while the other convergents do not give an r that passes the test.
On the other hand, adjacent outcomes that have an appreciable but lower probability do not give the correct order, for example for the outcome 110 the continued fraction expansion of ϕ = 6/8 gives {0, 1, 3/4} and incorrectly gives r = 4 as the order (see Appendix G for details).If the peaks for the outcomes are not well distinguished after a fixed number of shots, this type of failure in identifying the order can be avoided in principle by adding further qubits to the control register so that the peak in the probability distribution becomes narrower and more well defined [11].It is interesting to note that from the results of Ref. [11], successfully finding the order r = 3 was not possible to achieve, as with only two bits of accuracy in the experiment the continued fractions would always fail due to the peaked outcomes of 10 (2) and 11 (3) giving the convergents of {0, 1/2} and {0, 1, 3/4}, respectively.In our case, we successfully find r = 3, from which we obtain gcd(a r/2 ±1, N ) = gcd(8±1, 21) = 3 and 7. Thus, with our demonstration, extending the number of outcome bits to three has allowed us to fully perform the quantum factoring of N = 21.

D. Verification of entanglement
The presence of entanglement between the control and work registers is known to be a requirement for the algorithm to gain any advantageous speedup over its classical counterparts in general [6,24,25].For detecting genuine multipartite entanglement around the vicinity of an ideal state |ψ , one can construct a projector-based  d) and (g), and the corresponding matrix plot of their real parts in panels (e) and (h), and imaginary parts in panels (f) and (i), respectively.We observe there is a resemblance between the ideal state and the measured states, but noise in both real and imaginary parts is notable.
witness such as the one below: where α is the square of the maximum overlap between |ψ and all biseparable states.In other words, tr Ŵψ ρ ≥ 0 for biseparable states and tr Ŵψ ρ < 0 for states with genuine multipartite entanglement in the vicinity of |ψ [26].For the ideal state after modular exponentiation (but before the inverse QFT) in both the control and work registers, α = 0.75 was found using the method described in the appendix of Ref. [26].This was implemented using the software package QUBIT4MATLAB [27].Therefore ideally the state in both registers after modular exponentiation has genuine multipartite entanglement.
In order to check whether the output state from the IBM processors is close to the ideal state and has genuine multipartite entanglement, full state tomography would normally be needed to characterize the state ρ exp in both the control and work registers.This would require 3 5 measurements, making it impractical to gather a sufficiently large data set within a meaningful time frame.However, we need not measure the full density matrix, the quantity tr(|Ψ Ψ| ρ exp ) suffices.To measure this, we can decompose ρ = |Ψ Ψ| into 293 Pauli expectations as l σ (5)  m , (10) where σ i = {I, X, Y, Z} are the usual Pauli matrices plus the identity.However, the number of measurements needed to obtain all 293 expectation values can be reduced [28].This is because the measured probabilities from a measurement of a single Pauli expectation value, i.e.ZZZZZ , can be summed in various combinations to derive other Pauli expectations values, i.e.ZIZZZ , IZZZZ , etc.The values derived are nothing but the marginalization of the measured probabilities over the outcome space of some set of qubits (see Appendix E for details).We can do the same for each term in the set of terms from the Pauli decomposition of ρ, calling it S d , forming a set of other Pauli terms that can be derived from the same counts.
Taking the union of these sets to be S u , the complement S d \S u gives the 79 terms we only need to measure (see Appendix E).We measure the 79 Pauli expectation values of the terms above with respect to the state in both registers after modular exponentiation and from this we compute/derive the 293 terms in S d and therefore tr(|Ψ Ψ| ρ exp ).The measured probabilities for each term, some of them shown in The results obviously fail to detect genuine multipartite entanglement, however, this does not mean entanglement is entirely absent.Consider the square of the maximum overlap between the ideal state |Ψ and all pure states |θ that are unentangled product states with respect to some bipartite partition (bipartition) B of the qubits, Thus, any other state |ξ for which cannot be a product state with respect to the bipartition B, implying that there is non-separability, or entanglement, across this bipartition.The above result extends to mixed states ρ ξ due to the convex sum nature of mixed quantum states [27].We compute Eq. ( 12) for all possible bipartitions of our ideal state |Ψ (see Appendix F for more details).For the experimental state ρ 7Q we find, with the exception of the bipartition B = (c 0 c 1 c 2 q 1 )(q 0 ), that it is non-separable with respect to all other bipartitions, i.e. the square of the overlap between ρ 7Q and |Ψ (∼ 0.677) is greater than the maximal square overlap between |Ψ and all product states in each of these bipartitions.Similarly for ρ 27Q , with the exception of bipartitions B = (c 0 c 1 c 2 q 1 )(q 0 ) and B = (c 0 c 1 c 2 q 0 )(q 1 ), the state is nonseparable with respect to all other bipartitions.Most notably, both ρ 7Q and ρ 27Q are non-separable with respect to the bipartition B = (c 0 c 1 c 2 )(q 0 q 1 ), which is a bipartition between the control and work registers.This implies that non-separability or entanglement is present between the registers, as required for the algorithm's speedup in general [6,24,25].Furthermore, the maximum (not necessarily global but a good proxy of it) expectation value of the operator |Ψ Ψ| for product states, is found via a greedy search algorithm [27] to be around 0.30, further asserting that indeed the qubits are entangled with each other in some way.

V. CONCLUDING REMARKS
In summary, we have implemented a compiled version of Shor's algorithm on IBM's quantum processors for the prime factorization of 21.By using relative phase shift Toffoli gates, we were able to reduce the resource demands that would have been required in the standard compiled and non-iterative construction of Shor's algorithm (with regular Toffoli gates), and still preserve its functional correctness.The use of relative phase shift Toffoli gates has also allowed us to extend the implementation in Ref. [11] to an increased resolution.Moreover, while the latter implementation used only 1 recycled qubit for the control register, in contrast to our 3 qubits, it falls one iteration short of achieving full factoring for the reasons already mentioned.It is not clear what additional resource overheads (single and two-qubit gates) would be needed in implementing another iteration in their scheme and it is likely that these overheads are what prevented the full factoring of 21 in the photonic setup used.Furthermore, we note that in principle there is no real advantage in using 3 qubits for the control register as we have done here instead of 1 qubit recycled, as in Ref. [11].However, in practice it is not possible at present to recycle qubits on the IBM processors and so we used 3 qubits instead.In future, once this capability is added, a further reduction in resources will be possible for our implementation, potentially improving the quality of the results even more.
We have verified, via state tomography, the output state in the control register for the algorithm, achieving a fidelity of around 0.70.For the verification of entanglement generated during the algorithm's operation, the resource demands of state tomography were circumvented by measuring a much reduced number of Pauli measurements to uniquely identify a quantum state [28].However, this method is quite specialized and cannot be easily generalized to larger systems.In scaling up Shor's algorithm to higher integers beyond 21 using larger quantum systems, other methods of quantum tomography can be used to characterize the performance.These include compressed sensing [29] and classical shadows [30], which give theoretical guarantees, and improved scaling in the number of Pauli measurements and classical postprocessing than standard methods.In the case when the state belongs to a class of states with certain symmetries, such as stabilizer states, only a few measurements are required for measuring the fidelity and detecting multipartite entanglement [31].However, not all entangled states are neatly housed within these wellstudied classes.Ref. [32] introduces a device-independent method for multipartite entanglement detection which scales polynomially with the system size by relaxing some constraints.Another scheme constructs witnesses that require a constant number of measurements of the system size at the cost of robustness against white noise.This provides a fast and simple procedure for entanglement detection [33].Many fundamental questions on the subjects of quantum tomography and multipartite entanglement still remain to be answered [34] and advances will help in efficiently quantifying the performance of algorithms in larger quantum processors.
Our demonstration involves a two-fold reduction of the resource count from the full circuit in Fig. 2 via the replacement of regular Toffoli gates with relative phase variants, which is an approach that is in the spirit of the NISQ era; tailoring quantum circuits to circumvent the shortcomings of noisy quantum processors.In addition, we suspect that we can further reduce the resource count through the use of the approximate QFT [35], while still maintaining a clear resolution of the peaks in the output probability distribution.A possible avenue of future research derived from what we have reported here is the investigation and identification of scenarios where one can replace Toffoli gates with relative phase Toffoli gates while preserving the functional correctness, in a wide range of algorithms including Shor's algorithm, as seen here.In the present case, whether such an approach is special to the case of N = 21 or extendable to other N is not known.Ref. [23] has performed some work in this regard, however a proper analysis and systematic composition of relative phase Toffoli gates for such purposes is still an open problem.In future, a similar approach may make possible the factorization of larger numbers with adequate accuracy in resolution of the algorithm's outcomes and their characterization.
The Margolus gates in this particular quantum circuit never encounter the basis state |101 , thus the operation of the circuit remains unchanged by the replacement of full Toffoli gates with their respective relative phase counterparts.

Appendix C: IBM Quantum Experience
The experiments in this paper were conducted on the IBM Quantum Experience ibmq toronto and ibmq casablanca processors through the software development kit Qiskit [36].Each experiment reported here was conducted on the date shown in the table below.For characterization purposes, the compiled quantum order-finding experiments were submitted in batches of 900 circuits with each circuit having 8192 measurement shots.In total, 900 × 8192 measurements were made.In choosing the qubit device mappings shown in the main paper, preference was given to the qubit pairs with relatively small cX error rates.Tables II and III show reported single qubit-error rates for ibmq toronto and ibmq casablanca respectively, where U   Qiskit's state tomography fitter uses a least-squares fitting to find the closest density matrix described by Pauli measurement results [37].On an n-qubit system, the fitter requires measurement results from executing 3 n circuits.This makes state tomography on large circuits in impractical.Thus only 30 state tomography experiments were performed for the three control register qubits and in total 3 3 × 30 × 8192 measurement were made.
In reducing the effect of noise due to final measurement errors, Qiskit recommends a measurement error mitigation approach.The approach starts off by creating circuits that each perform a measurement of the 2 n basis states.The measurement counts of the 2 n basis state measurements are put into a column vector C noisy , arranged in ascending order by the value of their measurement bitstring, i.e. 00 . . .00 is the first element, the next is 00 . . .01 and so on.The approach assumes that there is a matrix M called the calibration matrix, such that  where C ideal is a column vector of measurement counts in the absence of noise.If M is invertible then, then C noisy can transformed into C ideal by finding M −1 Qiskit [38] uses a least-squares fit to calculate an approximate M −1 by some other matrix M −1 , as in general M is not invertible, giving The entries of the column vector C mitigated correspond to the mitigated measurement counts in same order as before.The entirety of the results reported in our work make of use of this approach.
to the control register.For a given separation of the qubits into two partitions (a bipartition), e.g.(c 1 )(c 0 c 2 q 0 q 1 ), there is a pure product state |φ with respect to these partitions, i.e. no entanglement between the partitions, that maximizes the overlap squared with the ideal state.The value of the overlap squared between this product state and the ideal state, e.g.0.5, is therefore the highest value that can be obtained for an unentangled state between the partitions.Thus, if a given state has an overlap squared larger than 0.5 it must be an entangled state with respect to the partitions.The value of the maximum overlap squared changes for the different partitions chosen as it depends on the structure of the ideal state.The above results extend to mixed states across the bipartitions due to the convex sum nature of quantum states [27].
then s/r will appear as a convergent in the continued fraction expansion of ϕ.If ϕ is an approximation of s/r accurate to 2L + 1 bits, then we have |s/r − ϕ| ≤ 1/2 2L+1 .For r ≤ N ≤ 2 L , we have that 1/2 2L+1 ≤ 1/2r 2 .Therefore, since the inequality holds for the approximation ϕ, there is a classical algorithm that can compute the convergents of ϕ, and produce integers s , r such that gcd(s , r ) = 1 in O(L 3 ) operations [2].We can then check if r is the order of a and N by testing whether a r mod N = 1.Note that in our approach, ϕ = ϕ s /2 n s/r is not an approximation that is accurate to 2L + 1 bits as above, but is a further approximation of s/r depending on the resolution, i.e. the number of iterations, or alternatively qubits in the control register.Consider the following example of the final measurement outcomes from Fig. 7 in the main text, where the outcome |110 = |6 is not a peak but |101 = |5 is a peak in the outcome distribution and we have used the integer representation of the binary outcome.The former outcome gives ϕ = 6  2 3 and latter gives ϕ = 5 2 3 .Computing the continued fractions of the former gives Computing the convergents gives 0, 1, 1/2, 2/3, 5/8.Looking at the former and latter computed convergents, we note that the third convergent of the latter correctly gives r = 3 while the convergents of the former do not give the correct order when tested using a r mod N = 1.The same process can be applied to the outcome |011 = |3 , which is a peak and correctly gives r = 3.

FIG. 2 .FIG. 3 .
FIG.2.Compiled quantum order-finding routine for N = 21 and a = 4.This circuit uses five qubits in total; 3 for the control register and 2 for the work register.The above circuit determines 2 n s/r to three bits of accuracy, from which the order can be extracted.Here, up to a global phase, S = Rz( π 2 ) and T = Rz( π 4 ) are phase and π/8 gates, respectively.

FIG. 7 .
FIG. 7. Results of the complete quantum order-finding routine for N = 21 and a = 4. On each processor, the circuit was executed 8192 × 100 times with measurement error mitigation.The error bars represent 95% confidence intervals around the mean value of each histogram bin (see Appendix D).The simulator probabilities show the ideal case.

FIG. 9 .
FIG. 9. Ideal and measured density matrices after the inverse QFT, estimated via a maximum-likelihood reconstruction from measurement results in the Pauli-basis.(a) The ideal state |Ψ Ψ| (only the real parts are shown, imaginary parts are less than 0.04).(b) A matrix plot of the real part of |Ψ Ψ|.(c) A matrix plot of the imaginary part of |Ψ Ψ|.These plots are compared with the measured states ρ27Q and ρ7Q in panels (d) and (g), and the corresponding matrix plot of their real parts in panels (e) and (h), and imaginary parts in panels (f) and (i), respectively.We observe there is a resemblance between the ideal state and the measured states, but noise in both real and imaginary parts is notable.
the convergents according to Eq. (G2) gives 0, 1, 3/4.On the other hand, computing the continued fractions of the latter ϕ r |x |u s .
2 can be simplified by noting that the states |1 and |4 are the only nonzero amplitude states in the work register after Û 1 may have been applied, thus prompting us to only consider |1 → |16 and |4 → |1 .A cX gate controlled by |c 1 targeting |q 1 followed by a Fredkin gate, swapping |q 0 and |q 1 realizes this simplified Û 2 TableIVshows the cX error rates for the two processors.The dates of the experiments are given in the captions.

TABLE II .
Reported single-qubit gate errors on 16 December 2020.

TABLE III .
Reported single-qubit gate errors on 06 December 2020.