# **Quantum circuits for Toom-Cook multiplication**

Srijit Dutta, Debjyoti Bhattacharjee, and Anupam Chattopadhyay Department of Computer Science and Engineering, IIT Bombay, India School of Computer Science and Engineering, Nanyang Technological University, Singapore

(Received 29 January 2018; published 12 July 2018)

In this paper, we report efficient quantum circuits for integer multiplication using the Toom-Cook algorithm. By analyzing the recursive tree structure of the algorithm, we obtained a bound on the count of Toffoli gates and qubits. These bounds are further improved by employing reversible pebble games through uncomputing the intermediate results. The asymptotic bounds for different performance metrics of the proposed quantum circuit are superior to the prior implementations of multiplier circuits using schoolbook and Karatsuba algorithms.

DOI: 10.1103/PhysRevA.98.012311

### I. INTRODUCTION

Quantum computing has gathered significant attention by solving certain problems much faster than any known classical algorithm. In contrast to Boolean logic, quantum bits (qubits) not only represent the classical 0 and 1 states but also any complex combination or superposition of both, leading to a significant speed-up in computing. The Deutsch-Jozsa algorithm [1] and Shor's factorization algorithm [2] are well-known examples demonstrating the power of quantum computing. This capability gives rise to the bounded-error quantum polynomial time (BQP) complexity class with an open quest among computer scientists and mathematicians to establish the exact relation between BQP and other complexity classes. In order to accelerate scientific computing using the capabilities of a quantum computer, efficient quantum circuits for basic mathematical functions are needed. The efficiency of a quantum circuit is measured by lower computational space (number of qubits) and lower computational time (logical depth). For fault-tolerant, error-protected quantum circuits to implement the quantum algorithms, it is projected that a large number of physical qubits are required for every logical qubit [3]. Naturally, potential solutions to reduce the number of logical qubits contribute to the overall efficiency of the quantum circuit.

Multiplication is one of the elementary mathematical operations of arithmetic. Fast long integer arithmetic is at the very core of many computer algebra systems. In quantum computing, apart from being used as a block in itself, integer multiplication is used as a subroutine in many applications, such as Shor's integer factorization algorithm and in Newton iterations for calculating many functions like the inverse [4].

In this paper, we present a quantum implementation of the Toom-Cook multiplication algorithm [5,6], which can attain better asymptotic complexity than simple schoolbook multiplication and the Karatsuba-based integer multiplication [7]. We further improve these bounds by analyzing pebble games on complete trees.

#### II. PRIOR WORKS

The problem of multiplication in the quantum domain has been explored previously. For small numbers, naïve schoolbook multiplication works best, with a runtime complexity  $O(n^2)$  that also translates to the logical depth in a quantum circuit realization. Karatsuba multiplication, implemented in quantum circuits [7], is usually faster when the multiplicands are longer than 320-640 bits, which also provides asymptotic improvement in terms of Toffoli cost and Toffoli depth over schoolbook multiplication. However, the number of qubits required for Karatsuba-based quantum implementation is higher than schoolbook multiplication. In the realm of quantum circuits, so far, the Schönhage-Strassen method (using fast Fourier transform) and Toom-Cook multiplication algorithm are not reported, even though it is known from classical implementations that these algorithms result in better run times when the operand size is much larger. We primarily focus on the Toom-Cook multiplication in this work. As reported in our results, this leads to significant savings of all the performance metrics for an efficient quantum circuit, clearly outperforming prior works.

The family of Toom-Cook methods is an infinite set of polynomial algorithms (Toom-2.5, Toom-3, Toom-4, etc.) [8]. Instead of using the more common Toom-3 implementation, we present the work with Toom-2.5 to avoid a division by 3 required by the former. This leads to reduction in overall circuit costs, as quantum division is costlier in terms of Toffoli count and Toffoli depth than simple addition or shift operations. Most of the higher Toom implementations require a similar division by constants that are not multiples of 2. The implementation of such divisions incurs higher quantum costs and therefore we avoid them.

When moving to quantum domains, gate sets need to go beyond classical to create the *superposition* effect of the inputs. The standard universal quantum gate library that efficiently implements fault-tolerant quantum error correction codes is the Clifford+T library [9,10]. In this library, the cost of implementing a T gate is sufficiently high to customarily neglect the cost of other Clifford group gates while determining the total cost of the quantum circuit. Therefore, the number of T gates is a metric to judge the cost of a quantum circuit. Also, the number of qubits used in a quantum circuit is another important

<sup>\*</sup>debjyoti001@e.ntu.edu.sg

standard, since the current quantum technologies still struggle to achieve error-free computation for large counts of qubits. A study of the space-time trade-off can be performed [11] using these two metrics. Another metric of importance is the *T depth*. T depth is defined as the number of T stages in a quantum circuit where each such stage consists of one or more T or  $T^{\dagger}$ gates performed concurrently on separate qubits. It is important to note that an input circuit with continuous parameter gates [e.g., z rotation gate  $R_z(\theta)$ ] is decomposed using a set of discrete basis gates, typically from the Clifford+T library. The exact number of Clifford+T gates needed for such a continuous parameter gate depends on the desired accuracy, and the discrete gate set provides only an approximation. In the context of the current work, we consider T count and T depth of the circuit to be proportional to the Toffoli count and Toffoli depth, respectively, by following the Toffoli decomposition proposed in [12].

### III. TOOM-COOK MULTIPLIER

Given two large integers  $n_1$  and  $n_2$ , the Toom-Cook algorithm splits them into k smaller parts of length l. The multiplication sub-operations are then computed recursively using Toom-Cook multiplication again until we are able to apply another algorithm on it for the last stage of recursion, or until the desired multiplier is reached. The input numbers are divided into limbs of a given size, each in the form of a polynomial, and the limb size is used as a radix. Instead of multiplying the obtained polynomials directly, they are evaluated at a set of points and the values multiplied together at those points. Based on the products obtained at those points, the product polynomial is computed by interpolation. The final result is then obtained by substituting the radix.

In general, Toom-k runs in  $\Theta(c(k)n^e)$ , where n denotes input size, k is the number of parts that the input operand is decomposed into, and  $e = \log_k{(2k-1)}$ . c(k) is the time spent on auxiliary additions and multiplications by small constants. The Karatsuba algorithm [13] is a special case of Toom-Cook multiplication (Toom-2), where the input operand is split into two smaller ones. It reduces four multiplications to three and so operates at  $\Theta(n^{\log_2{3}})$ . In general, Toom-k reduces  $k^2$  multiplications to 2k-1 ordinary long multiplications [equivalent to Toom-1 with complexity  $\Theta(n^2)$ ].

## A. Implementation details

Let x and y be two n-bit numbers. To proceed with the Toom-2.5 algorithm, we first decompose x and y into two and three parts, respectively. Express  $x = x_1 2^i + x_0$  and  $y = y_2 2^{2i} + y_1 2^i + y_0$  with  $i \ge 1$ . Typically, i is chosen as  $\max\{\lfloor \frac{\lceil \log_2 x \rceil}{k} \rfloor, \lfloor \frac{\lceil \log_2 y \rceil}{k} \rfloor\}$ , where k = 2.5 in our case. We define the following four product terms:

$$P = x_0 y_0, \tag{1}$$

$$Q = (x_0 + x_1)(y_0 + y_1 + y_2), \tag{2}$$

$$R = (x_0 - x_1)(y_0 - y_1 + y_2), \tag{3}$$

$$S = x_1 y_2. \tag{4}$$

Then the product xy is evaluated as

$$xy = A2^{3i} + B2^{2i} + C2^{i} + D, (5)$$

$$A = S, (6)$$

$$B = -P + \frac{1}{2}Q + \frac{1}{2}R,\tag{7}$$

$$C = \frac{1}{2}Q - \frac{1}{2}R - S,\tag{8}$$

$$D = P. (9)$$

Note that only four multiplications are required for computation of the product. Also, each of these multiplications consists of numbers of size smaller than the original problem size (bit-width). Each smaller multiplication is between one number of bit-width n/2 and another of bit-width n/3. Since this method is applied in recurrence the second time for our analysis, we consider that the smaller limbs formed from the number which was split into three parts originally is now split into two parts and vice versa. So after two steps, we get 16 smaller problems of size n/6 each. Thus, we obtain the basic recurrence for the number of steps T(n):

$$T(n) = 16T\left(\frac{n}{6}\right). \tag{10}$$

All additions (the intermediate ones as well as the final ones) are performed by separate adders which have bounded cost.

### B. Gate count analysis

For gate count analysis, we consider only the Toffoli count required by the quantum circuit or subcircuit. This is because the other used gates [NOTS, controlled-NOTS (CNOTS)] do not contribute to the T count of the circuit, considering the Clifford+T library. The designed circuit maps  $(x,y,0,0) \mapsto (x,y,g,xy)$ , where g denotes some garbage output resulting as computation of A,B,C, and D. The product is copied after the computation and the circuit is then run backwards (uncomputed) to set the garbage outputs back to 0.

In our circuit implementation, the Cuccaro adder is used [14]. For addition of two n-bit numbers, the Cuccaro adder requires 2n-1 Toffoli gates. It is also established that the cost  $A_n$ , for an in-place adder adding two n-bit numbers, is bounded by 2n Toffoli gates.

Let  $T_{n,n}$  denote the multiplication call to a Toom-2.5 circuit for calculating product to two n-bit numbers and  $TC_n$  denote the number of Toffoli gates required for implementing  $T_{n,n}$ . First, we need to calculate P,Q,R, and S. This requires four recursive calls to  $T_{\frac{n}{2},\frac{n}{3}}$ . For calculating the intermediate sums required as input for  $T_{\frac{n}{2},\frac{n}{3}}$ , we need four n/2 bit adders and six n/3 bit adders. This also includes uncomputation of the intermediate garbage results, i.e., the qubits used for storage of intermediate results are returned to their initial states. The output of each  $T_{\frac{n}{2},\frac{n}{3}}$  is a 5n/6 bit number. Finally, for computing A,B,C, and D, four 5n/6 adders are required. As already stated earlier, in evaluation of  $T_{\frac{n}{2},\frac{n}{3}}$  we assume that the n/2 bit number is split into three parts and vice versa. By performing



FIG. 1. Recursion tree structure of the Toom-2.5 implementation.

similar analysis, we get evaluation of  $T_{\frac{n}{6},\frac{n}{6}}$ , in terms of which the recursive relation is provided:

$$TC_{n} = 16TC_{n/6} + 40A_{n/6} + 22A_{n/3} + 4A_{n/2} + 4A_{5n/6}$$

$$= 16^{\log_{6} n}TC_{1} + 40\left(A_{\frac{n}{6}} + 16A_{\frac{n}{36}} + \cdots\right) + 22\left(A_{\frac{n}{2}} + 16A_{\frac{n}{12}} + \cdots\right) + 4\left(A_{\frac{n}{3}} + 16A_{\frac{n}{23}} + \cdots\right)$$

The base case is the multiplication of two 1 bit numbers which can be done by a Toffoli gate. Therefore,  $TC_1 = 1$ . Each of the summations of the adder gate counts have  $\log_6 n$  terms. On evaluating the summations using geometric progression and doubling the cost to account for the aforementioned uncomputation, we get

 $+4(A_{\frac{5n}{n}}+16A_{\frac{5n}{n}}+\cdots).$ 

$$TC_n = 2\left(16^{\log_6 n} + 23.2n\left[\left(\frac{16}{6}\right)^{\log_6 n} - 1\right]\right)$$
 (13)

$$=2n^{\log_6 16} + 46.4n(n^{\log_6(16/6)} - 1) \le 49n^{\log_6 16}.$$
 (14)

Note that all operations used in the circuit design are implemented using only adders and shifts, without any separate multiplication and division blocks.

### C. Space-time trade-offs

The recursive nature of the problem gives rise to an inherent tree structure as shown in Fig. 1. The size of a node is representative of the problem size at that level. For example, the root level denotes the complete problem (*n*-bit multiplication). According to the recursion presented in Eq. (10), each node will have 16 children nodes denoting a smaller problem [(n/6)-bit multiplication]. For the Toom-2.5 circuit with an input of size n at any level x in the tree, there are  $16^x$  nodes of size  $n6^{-x}$ , each for a total cost of  $n(\frac{16}{6})^x$  at level x (level numbering starting at 0 from root). So, the space cost  $Q_{\text{orig}}$  of the complete tree is

$$Q_{\text{orig}} = n \sum_{0}^{N-1} \left(\frac{16}{6}\right)^{x}$$
 (15)

$$= n \frac{(16/6)^{\log_6 n} - 1}{(16/6) - 1} \tag{16}$$

$$= O[n(8/3)^{\log_6 n}] = O(n^{1 + \log_6 (8/3)})$$
 (17)

$$\approx O(n^{1.547}),\tag{18}$$

where the tree height  $N = \log_6 n$ .

The reversible pebble game [15] is a combinatorial game played on rooted directed acyclic graphs (DAGs). Each pebble represents some amount of space. The rules are similar to those used in the pebble game to model irreversible computation, except that we simply cannot remove pebbles by a reversibility constraint. There is a reverse computation for each corresponding computation performed, implying that during the game, the pebbles may still be removed but it is subject to the same conditions as applied during placing the pebbles. We use this reversible game to obtain better asymptotic bounds in the number of qubits (space) to implement the Toom-2.5 algorithm.

We want to find a level in the recursion tree such that the size of each node's subtree is approximately equal to the sum of the size of all nodes at that level chosen and above. Once all the nodes in the chosen level have been computed, we uncompute all the subtrees below it. This is performed to minimize space the size of these subtrees is chosen to be approximately equal to the remaining size of the tree above them. Let the required height be k from the leaves of the tree. The cost of all height k subtrees is

$$n\sum_{N=k}^{N-1} \left(\frac{16}{6}\right)^x.$$

Therefore, the cost of a single height k subtree is

$$\frac{n}{16^{N-k}} \sum_{N-k}^{N-1} \left(\frac{16}{6}\right)^x = \frac{n}{6^{N-k}} \sum_{0}^{k-1} \left(\frac{16}{6}\right)^x.$$

We want this to equal the cost of all nodes above the *k*th level:

$$n\sum_{0}^{N-k-1} \left(\frac{16}{6}\right)^{x} = \frac{n}{6^{N-k}} \sum_{0}^{k-1} \left(\frac{16}{6}\right)^{x}.$$
 (19)

Simplifying, we obtain a bound that  $k \leq \frac{N}{2 - \log_{16} 6}$ . This is since  $k \leqslant N$  and  $\left(\frac{16}{6}\right)^{N-k} \geqslant \frac{16^k}{6^N}$ . Using the above technique, the qubit count is now optimized and bounded by  $Q_{\text{opti}}$ :

$$Q_{\text{opti}} = O\left[n\left(\frac{8}{3}\right)^{\frac{1}{2-\log_{16}6}\log_{6}n}\right] \approx O(n^{1.404}).$$
 (20)

The time complexity of a quantum circuit is effectively equal to the depth of the circuit in terms of Toffoli gates. Each node in the computation tree shown in Fig. 1 at level k must be computed sequentially. At the kth level, the number of subtrees  $ST_k$  and corresponding depth  $D_k$  is defined as follows:

$$ST_k = 16^{(1 - \frac{\log 16}{2\log 16 - \log 6})\log_6 n},\tag{21}$$

$$D_k = \frac{n}{6^{(1 - \frac{\log 16}{2\log 16 - \log 6})\log_6 n}},$$
(22)

$$D_k = \frac{n}{6^{(1 - \frac{\log 16}{2 \log 16 - \log 6}) \log_6 n}},$$

$$ST_k * D_k = n \left(\frac{8}{3}\right)^{(1 - \frac{\log 16}{2 \log 16 - \log 6}) \log_6 n} \approx n^{1.143}.$$
(23)

The product  $ST_k * D_k$  gives an overall depth for computing the entire kth level of the recursion tree.



FIG. 2. The quantum circuit for computing integer multiplication results using the Toom-2.5 algorithm. The compute blocks are then run backwards (uncomputed) to set the garbage outputs (*g*) back to 0 (not shown in the figure).

The method proposed above is most efficient if both the numbers to be multiplied are approximately of the same bitwidth. In case one of them is much bigger than the other, it is better if the bigger number is repeatedly divided into three parts in each turn, until the smaller parts of both the numbers are roughly the same size. Following this method, the asymptotic computational complexity can be shown to be more efficient than that of the *alternating* Toom-2.5 method adopted.

The circuit of the described implementation is shown in Fig. 2. It describes the circuit for  $T_{n,n}$  that multiplies x (decomposed into  $x_0, x_1$ ) and y (decomposed into  $y_0, y_1, y_2$ ). All symbols and variables mentioned hold the same meanings as described in the analysis above. The adder, subtractor, and shifting blocks are represented as "Adder," "Sub," and "Shift," respectively. The  $T_{\frac{n}{2},\frac{n}{3}}$  blocks denote Toom-2.5 subcircuits of smaller bit-width.

### IV. RESULTS AND DISCUSSIONS

Table I presents the asymptotic results of implementation of various multiplication methods, while Table II provides the exact constants involved. The naïve multiplication method suggested in [7] allows implementation with the lowest number

TABLE I. Asymptotic performance analysis of the quantum implementation of various multiplication methods.

| Method              | QC             | TC                | TD             |
|---------------------|----------------|-------------------|----------------|
| Naïve [7]           | O(n)           | $O(n^2)$          | $O(n^2)$       |
| Naïve improved [16] | O(n)           | $O(n^2)$          | $O(n \log n)$  |
| Karatsuba [7]       | $O(n^{1.427})$ | $O(n^{\log_2 3})$ | $O(n^{1.158})$ |
| Toom-2.5            | $O(n^{1.404})$ | $O(n^{log_616})$  | $O(n^{1.143})$ |
| Const. Mult. [17]   | O(n)           | $O(n^2)$          | O(n)           |

QC: Qubit count, TC: Toffoli count, TD: Toffoli depth.

of qubits asymptotically but fares badly in terms of Toffoli count and depth.

In [16], the implementation of logarithmic depth adders has been provided. The naïve (shift-add) multiplier can be improved in depth by using the logarithmic depth adder as a submodule. The *n*-bit adder has a depth of order  $O(\log n)$ , and thus the multiplier shall have a depth of  $O(n \log n)$ . However, for both the "in-place" and "out-of-place" adders described in [16], extra ancilla are required for intermediate computation. Also, the Toffoli count is greater compared to the Cuccaro adder. Thus, the multiplier developed by extension, though optimized in depth, will have greater asymptotic Toffoli and qubit count, equal to  $O(n^2)$ . In Table I, we provide the asymptotic complexity of such a multiplier. However, in the absence of an explicit design, we are unable to provide the exact constants involved in the cost metrics and hence the naïve improved method mentioned in Table I is excluded from Table II. Toom-2.5 requires less number of qubits than Karatsuba [7]. Toom-2.5 outperforms both the naïve and Karatsuba methods in terms of Toffoli count as well as Toffoli depth, highlighting the efficiency of the proposed method.

Pavlidis *et al.* [17] presented a depth-optimized multiplier for multiplication by a constant only. Therefore, it is unfair to be directly compared with our implementation and the Karatsuba

TABLE II. Cost of quantum implementation of multiplication.

| Method            | QC                                                | TC                | TD                                                             |
|-------------------|---------------------------------------------------|-------------------|----------------------------------------------------------------|
| Naïve [7]         | 4n + 1                                            | $4n^2-3n$         | $4n^2 - 4n + 1$                                                |
| Karatsuba [7]     | $n(\frac{3}{2})^{\frac{\log_2 n}{2-\log_3 2}}$    | $42n^{\log_2 3}$  | $n(\frac{3}{2})^{\left(1-\frac{1}{2-\log_3 2}\right)\log_2 n}$ |
| Toom-2.5          | $n(\frac{8}{3})^{\frac{\log_6 n}{2-\log_{16} 6}}$ | $49n^{\log_6 16}$ | $n(\frac{8}{3})^{(1-\frac{1}{2-\log_{16}6})\log_6 n}$          |
| Const. Mult. [17] | 3n + 1                                            | 4n(n+1)           | 8 <i>n</i>                                                     |



FIG. 3. Comparison of the quantum multiplier implementations based on (a) Toffoli count, (b) Toffoli depth, and (c) number of qubits.

multiplication implementations presented in [7]. It has a Toffoli depth of 8n, a cost of 4n(n + 1), and qubit count of 3n + 1.

The Clifford+T quantum gate library has garnered much interest in the implementation of fault-tolerant quantum circuits [18]. As mentioned in [19,20], the cost of a Toffoli gate is higher compared to the NOT and CNOT gates. The Toffoli gate may be decomposed using Clifford+T gates, which makes cost metrics associated with Toffoli gates important. Therefore, Toffoli count and Toffoli depth are used as the performance metrics to begin with. The cost of mapping a Toffoli gate to the Clifford+T fault-tolerant library is upper bounded by  $7 \times T$  Toffoli count and  $3 \times T$  Toffoli depth [12]. Therefore, fault-tolerant implementation of the proposed multiplication method would have at most  $7 \times T$  Toffoli count and  $3 \times T$  Toffoli depth of the values mentioned in Table I. It is further possible to improve these values by the optimization techniques proposed in [9,12].

Figure 3(a) presents a comparison of the Toffoli count required by the various methods for variation in the bit-width of the inputs. The naïve multiplication method performs better in terms of total Toffoli cost at smaller input sizes (<300 bits) but is outperformed by the Karatsuba and Toom algorithms at higher bit-widths.

Figure 3(c) shows the variation in the qubit requirements by the different implementations across a range of input sizes. In this case the shift-and-add method (naïve) outperforms both the recursive algorithms as it increases linearly. However, this low space requirement leads to a higher depth, as demonstrated in Fig. 3(b) in a logarithmic scale. Both Toom-2.5 and the Karatsuba implementations perform much better in this respect.

We also present a bound on the CNOT counts of the considered implementations. In the proposed Toom-2.5 circuit shown in Fig. 2, CNOT gates are present in the Cuccaro adders and copy blocks. It can be seen from [14] that the number of CNOT gates in an n-bit adder can be bounded by 5n. Proceeding similarly as the Toffoli count analysis, we get an exactly similar recurrence relation as presented in the Gate Count Analysis in Sec. III. Let  $CC_n$  denote the number of CNOT gates in  $T_{n,n}$ . Also, let  $Ac_n$  denote the number of CNOT for an in-place n-bit adder:

$$CC_n = 16CC_{n/6} + 40Ac_{n/6} + 22Ac_{n/3} + 4Ac_{n/2} + 4Ac_{5n/6}$$
(24)

$$= 16^{\log_6 n} CC_1 + 40 \left( Ac_{\frac{n}{6}} + 16Ac_{\frac{n}{36}} + \cdots \right) + 22 \left( Ac_{\frac{n}{3}} + 16Ac_{\frac{n}{18}} + \cdots \right) + 4 \left( Ac_{\frac{n}{2}} + 16Ac_{\frac{n}{12}} + \cdots \right) + 4 \left( Ac_{\frac{5n}{6}} + 16Ac_{\frac{5n}{6}} + \cdots \right) + COPY_{cnot}$$
(25)

where  $COPY_{cnot}$  denotes the number of CNOTs used in the two copy blocks. However, the number of CNOT gates arising out of the copy blocks is of the order O(n) and is dominated by the terms of order  $n^{\log_6 16}$ .  $CC_1 = 0$  because the 1-bit multiplier consists of just one Toffoli gate:

$$CC_n \approx 2\left(58n\left[\left(\frac{16}{6}\right)^{\log_6 n} - 1\right]\right)$$
 (26)

$$= 116n(n^{\log_6(16/6)} - 1) \leqslant 116n^{\log_6 16}.$$
 (27)

By similar analysis, the CNOT count of the Karatsuba multiplier can be bounded by  $100n^{\log_2 3}$ . For the naïve method, controlled adders are considered as described in [7]. Each such adder has 2n CNOTs and the multiplier uses n-1 such adders. Thus, the total CNOT count is  $2n^2-2n$ . For the constant multiplier in [17], 2n CNOT gates are employed. These observations are summarized in Fig. 4.

In [20], it has been established that the T gate is at least 6 times costlier compared to the CNOT gate, which emphasizes the importance of T count and T depth. However, with increasing circuit size, the cost of CNOT may take a dominant role if we follow the analysis in terms of upper and lower bounds [19]. From that perspective, the study of overall cost is important. As we found for the case of multipliers, the CNOT count of the Toom-2.5 multiplier grows at slightly lower rate compared to that of the Karatsuba multiplier with increasing input size. Considering the fact that the Toffoli count of the Toom-2.5 multiplier already outperforms Karatsuba



FIG. 4. Variation in CNOT counts across different implementations with increasing input size.

at large input sizes, the proposed design is clearly more efficient.

### V. CONCLUSION

Designing an efficient quantum circuit with low resource requirements and faster run time is an important challenge with significant repercussions across several domains, such as scientific computing and security. In this work, we reported an efficient quantum circuit for integer multiplication based on the Toom-Cook algorithm. We provide design results and techniques for lowering the resource requirements. In terms of asymptotic complexity, the presented implementation outperforms the state-of-the-art results for multiple performance metrics.

- D. Deutsch and R. Jozsa, Proc. R. Soc. London, Ser. A 439, 553 (1992).
- [2] P. W. Shor, in Proc. of the IEEE 35th Annual Symposium on Foundations of Computer Science (IEEE, New York, 1994), pp. 124–134.
- [3] E. T. Campbell, B. M. Terhal, and C. Vuillot, Nature (London) **549**, 172 (2017).
- [4] M. Soeken, M. Roetteler, N. Wiebe, and G. De Micheli, in 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE) (IEEE, New York, 2017), pp. 470– 475.
- [5] A. L. Toom, Soviet Mathematics Doklady, 3, 714 (1963).
- [6] S. A. Cook and S. O. Aanderaa, Trans. Am. Math. Soc. 142, 291 (1969).
- [7] A. Parent, M. Roetteler, and M. Mosca, 12th Conference on the Theory of Quantum Computation, Communication and Cryptography, {TQC} 2017, June 14-16, 2017, Paris, France, edited by M. M. Wilde (Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 2018), Vol. 73.
- [8] D. E. Knuth, *The Art of Computer Programming* (Pearson Education, New York, 1997), Vol. 2.

- [9] M. Amy, D. Maslov, and M. Mosca, IEEE Trans Comput.-Aided Design Integr. Circuits Syst. **33**, 1476 (2014).
- [10] A. G. Fowler, A. M. Stephens, and P. Groszkowski, Phys. Rev. A 80, 052312 (2009).
- [11] R. Wille, M. Soeken, D. M. Miller, and R. Drechsler, INTE-GRATION 47, 284 (2014).
- [12] N. Abdessaied, M. Amy, M. Soeken, and R. Drechsler, in *Proc.* of the IEEE 46th International Symposium on Multiple-Valued Logic (ISMVL) 2016 (IEEE, New York, 2016), pp. 150–155.
- [13] A. Karatsuba and Y. Ofman, Sov. Phys. Dokl. 7, 595 (1963).
- [14] S. A. Cuccaro, T. G. Draper, S. A. Kutin, and D. P. Moulton, arXiv:quant-ph/0410184.
- [15] C. H. Bennett, SIAM J. Comput. 18, 766 (1989).
- [16] T. G. Draper, S. A. Kutin, E. M. Rains, and K. M. Svore, Quantum Inf. Comput. **6**, 351 (2006).
- [17] A. Pavlidis and D. Gizopoulos, Quantum Inf. Comput. 14, 649 (2014).
- [18] Y. S. Weinstein, Phys. Rev. A 87, 032320 (2013).
- [19] D. Maslov, Quantum Inf. Comput. 16, 1096 (2016).
- [20] V. V. Shende and I. L. Markov, Quantum Inf. Comput. 9, 461 (2009).