Introduction

Energy dissipation of Information Communication Technology (ICT), such as data centers, accounts for over 70 billion kiloWatt hours (kWh) of electricity consumption in 2016, or over 2% of the total energy consumption in the U.S.1. This number is projected to climb up to 20% of total electricity and emit up to 5.5% of the world’s carbon emissions by 2020. The power dissipation of individual supercomputers is reaching as high as 17,000 kW while the performance is approaching 10 exaFLOPS (EFLOPS)2. The significant amount of energy consumption has become a critical problem in modern society, and arouses us of the urgent requirement for energy-efficient computing technologies.

Being widely-known for low energy dissipation and ultra-fast switching speed, Josephson junction-based superconductor logic families have been proposed and implemented to process analog and digital signals for decades. Rapid-Single-Flux-Quantum (RSFQ) logic, proposed by K. Likharev, O. Mukhanov and V. Semenov in 19853, is one leading technology among many alternative superconducting electronic devices. RSFQ-based logic circuits can operate at high clock frequency of hundreds of GigaHertz with very low switching energy on the superconducting devices in the order of 10−19 J. However, on-chip resistors are needed to supply a constant DC bias current to the main RSFQ circuit. This will lead to an increasing static power as the circuit scale expands, and makes power dissipation a disadvantage of RSFQ. Various low power technologies, such as energy-efficient single-flux quantum (eSFQ, ERSFQ)4,5, reciprocal quantum logic (RQL)6, LR-biased RSFQ logic7 and low-voltage RSFQ (LV-RSFQ)8, have been proposed to (partially) resolve the static power dissipation problem of RSFQ by research groups around the world.

In order to mitigate the power consumption overhead of DC bias, the Adiabatic Quantum-Flux-Parametron (AQFP) technology has been proposed using AC bias/excitation currents as both (multi-phase) clock signal and power supply9. AQFP circuits operate at a frequency of few GigaHertz, in between conventional CMOS technology and RSFQ logic. The major advantage of AQFP is the remarkable energy efficiency potential. A latest work10 analyzed the energy dissipation of an 8-bit AQFP adder and reported a 24kBT energy dissipation per junction based on the physical test. This number indicates that AQFP technology is a promising candidate to build low-energy systems approaching Landauder’s limit11,12,13,14. The demonstrations of several AQFP implementations have been reported, which include an 8-bit carry-look-ahead adder15, a 16-word by 1-bit register file16, a prototype deep learning accelerator17 and a large-scale benchmark chip consisting of 10,000 AQFP logic gates18. These results demonstrate the robustness of AQFP technology against circuit parameter variations and the potential towards building very large-scale integrated circuits using AQFP devices. Details of the operation principles of AQFP logic devices can be found in the ref.9.

The AQFP technology is promising and rapidly advancing; however, there lacks a systematic, automatic synthesis framework and detailed synthesis results on a large number of benchmark circuits. The framework and results will be beneficial for the further development of AQFP by (i) automatic logic and circuit generation and (ii) illustrating the advantage and limitation of AQFP and the circuit structures that are especially suitable for AQFP technology. In this paper we aim to mitigate this gap, by presenting an automatic synthesis flow and performing synthesis on 18 benchmark circuits, including 11 circuits from the ISCAS-85 benchmark suite, 6 deep-learning accelerator components, and a 32-bit RISC-V ALU. Synthesis is performed using our established standard cell library of AQFP technology (with 4-phase clock signals) and our proposed energy consumption estimations. The proposed energy consumption estimation methodology is accurate and specifically designed for AQFP circuits. Comparison results are presented among our AQFP 10 kA/cm2 standard cell library and TSMC 12 nm FinFET, 28 nm, 40 nm CMOS cell libraries19,20,21. The results demonstrate the consistent energy benefit of AQFP technology. More specifically, it is forecasted that the AQFP technology can achieve a maximum of 9,313×, 25,242× and 48,466× improvements (reduction) in energy consumption per clock cycle, respectively, compared to the results using 12 nm, 28 nm and 40 nm TSMC technologies.

Results and Discussion

In the experiments, we synthesize 18 circuits, including 11 combinational benchmark circuits from the ISCAS’85 benchmark circuit suite, 6 deep-learning accelerator components, and a 32-bit RISC-V ALU, by using our developed AQFP standard cell library with the 10 kA/cm2 Niobium fabrication technology as well as three semiconductor technologies: TSMC 12 nm, 28 nm and 40 nm. The synthesis methodology on AQFP technology is novel and will be discussed in details next.

Design flow and energy estimation for AQFP

In this work, we utilize a top-down design flow for AQFP very-large-scale-integration (VLSI) circuits. This design flow starts from high-level synthesis, standard cell library mapping, automatic routing to back-end verification. We adopt the synthesis flow to generate 18 combinational AQFP benchmark circuits as mentioned above. In previous work22, we have presented a straightforward energy estimation approach by multiplying the total Josephson-junction count of an AQFP circuit by 5zJ. This empirical value 5zJ is from the experimental results, showing that the energy dissipating on each AQFP buffer gate using two shunted Josephson junctions is about 10zJ at 5 GHz23, fabricated using the 2.5 kA/cm2 AIST standard process 2 (STP2)24. In this study we perform energy estimation in a more accurate way. We carefully extracted the energy dissipation of each cell using the analog simulation tool Jsim25 and developed a set of energy models to accurately estimate the energy dissipation of the benchmark circuits using the presented 10 kA/cm2 AQFP cell library. These energy models describe the input-dependent energy dissipation of all gates in the cell library. Figure 1 shows the schematic of a test circuit designed for extracting energy dissipation of an under-test AQFP cell (DUT). The energy extraction flow is summarized in the following steps:

  1. 1.

    The energy dissipation of a single AQFP cell is extremely small. To accurately calculate the energy dissipation of a specific cell, first we insert 4-stage buffers before and after the target cell, and calculate the total energy consumption by integrating the current and voltage of each excitation/clock input for one clock cycle using the following formula:

    $$E={\int }_{t}^{t+T}({I}_{x1}{V}_{x1}+{I}_{x2}{V}_{x2}+{I}_{x3}{V}_{x3}+{I}_{x4}{V}_{x4})dt$$
    (1)

    where Vxn and Ixn are the voltage and current of the excitation lines, generated by Jsim25. The summation is over the 4-phase clock signals.

  2. 2.

    Being directly connected to input current source and ground, buffers in the first and last stages (marked in black) are relatively larger than the other buffers (marked in white). Hence, we calculate the energy-delay-product (EDP) of the buffers in the first stage, last stage and middle stage separately.

  3. 3.

    Energy dissipation of the target cell is generated by calculating the difference of the total energy dissipation from step 1 and energy consumption of inserted buffers from step 2, as shown in the following formula. Please note that we use 2 head buffers, 9 middle buffers, and 1 tail buffer, as can be observed from the figure.

$${E}_{DUT}={E}_{entire}-\frac{9\times ED{P}_{middle\_buffer}+2\times ED{P}_{head\_buffer}+ED{P}_{tail\_buffer}}{T}$$
(2)
Figure 1
figure 1

Schematic of a test circuit used for extracting energy dissipation of an under-test AQFP cell.

Energy dissipation data of all types of cells from AQFP standard cell library have been extracted by the presented method. Table 1 presents the extracted energy dissipation with respect to different data input patterns and different clock rates (frequencies).

Table 1 Energy dissipation of buffer, splitter, AND, OR and Majority cells with different input data and clock frequencies.

With the developed energy models, we are able to estimate the energy consumption performance of various AQFP benchmark circuits using a top-down synthesis flow. Figure 2 illustrates the details of the synthesis flow.

Figure 2
figure 2

 A top-down synthesis flow used for energy estimation of AQFP circuits.

Logic synthesis in the VLSI design flow plays the role of converting a high-level description of design into an optimized gate-level representation. A CMOS-based open-source synthesis tool ‘Yosys’26 is utilized to synthesize circuits described at the behavior level and perform technology mapping to our (AQFP) cell library written in the Liberty library format (.lib). Synthesized circuits are described as structural and-or-inverter (AOI)-based or netlists. Since the AC current serves as clock signal in AQFP circuits, extra AQFP buffers are required to ensure that the dataflow is synchronized at all logic levels of the circuit. In order to make a fair energy estimation, an AQFP interpreter is developed to apply post-synthesis to the CMOS logic-based netlist. This interpreter creates an AQFP specific netlist, which is later mapped to our developed energy models to generate statistic files for energy analysis.

Synthesis and energy estimation for semiconductor technologies

For the three representative semiconductor technologies, we adopt the synthesis and power/energy estimation framework described in the paper27. We first synthesize a benchmark circuit using the corresponding standard cell library and obtain the synthesized netlist in the Verilog format (named testbench.v) and a standard delay format (.sdf) file for gate-level simulator, e.g., ModelSim, NC Verilog. For the technology mapping step, we set the target delay of each benchmark circuit to be 30% more than the minimum delay at the given power supply level. A forward-switching activity interchange format (.saif) file, which contains state- and path-dependent information of all standard cells, is generated. Meanwhile, based on an input benchmark.v file, which specifies the average switching activities at primary inputs of a synthesized circuit, another forward .saif file is generated for the circuit in order to set the primary input activities and produce information of nets in netlist that should be monitored for switching activity to the gate-level simulator. The gate-level simulator determines the information about switching activities at all nets in the netlist and logs it in a backward .saif file. The power/energy analysis tool, e.g., Power Compiler, uses this backward .saif file and the power parameters in standard cell libraries to report accurate power/energy consumption results. The overall synthesis and power/energy estimation flow is depicted in Fig. 3. The TSMC 12 nm and 40 nm library are synthesized at supply voltage Vdd = 0.81 V, whereas the 28 nm library is synthesized at Vdd = 0.72 V. Different from the previous work27, the wire capacitances are accounted for in all cases for more accurate results.

Figure 3
figure 3

Synthesis and power/energy estimation framework for TSCM 12 nm, 28 nm and 40 nm semiconductor technologies.

Comparison results and discussions

Table 2 summarizes the power, delay and area of 18 synthesized benchmark circuits using TSMC 12 nm FinFET, 28 nm and 40 nm CMOS libraries. Table 3 summarizes the energy dissipation results among AQFP and three semiconductor technologies in terms of energy-per-clock cycle (EPC) and energy-delay-product (EDP). Comparisons of EPC and EDP between semiconductor and superconductor technologies are further illustrated in Figs 4 and 5. For AQFP circuits, we do not report the end-to-end delay because each gate in AQFP is essentially pipelined, and therefore, the end-to-end delay is very much different compared with the initialisation interval of the input instructions. The initialisation interval equals to the clock period and is utilised for EDP calculation for AQFP circuits. Buffers and splitters are accounted for in the AQFP synthesis results. From the comparison results, one could conclude that the AQFP circuits consistently outperform the semiconductor counterparts in terms of both EPC and EDP. The maximum gains on EPC and EDP are 4.84 × 104 and 1.18 × 105, respectively, compared with 40 nm TSMC library. The EDP gains are in general higher than EPC because the clock frequency in AQFP can reach as high as 5 GHz, which is higher compared with that in the CMOS circuits and systems.

Table 2 Summary of benchmark synthesis results using different semiconductor (FinFET/CMOS) logic families, in terms of power, delay and area.
Table 3 Summary of benchmark synthesis results using different technologies of semiconductor (FinFET/CMOS) and superconductor (AQFP) logic families.
Figure 4
figure 4

Comparison of energy per clock cycle on different benchmark circuits using TSMC 12 nm, 28 nm 40 nm process and 10 kA AQFP technologies.

Figure 5
figure 5

Comparison of energy-delay-product (EDP) on different benchmark circuits using using TSMC 12 nm, 28 nm 40 nm process and 10 kA AQFP technologies.

One common concern about the superconducting technology and applications is the energy overhead for cooling. It is widely estimated that the cooling energy is about 400×28 compared with the energy dissipation of superconducting circuits. Of course this value depends on environmental temperature. For space applications the cooling energy can be significantly reduced. Even when the cooling energy is accounted for, AQFP technology will still enable around two orders of magnitude reduction (improvement) in energy dissipation or EDP, compared with semiconductor counterparts. This is unique characteristics of AQFP and cannot be achieved using RSFQ logics requiring DC bias currents. These results demonstrate the potential of AQFP technology and applications for large-scale, high-performance, and energy-efficient computations.

Another concern about superconducting circuits is the overhead of inserted buffers and splitters. Especially for large-scale AQFP circuits, will the inserted buffers and splitters account for a significant portion of Josephson junction (JJ) counts and energy dissipations? We illustrate in Table 4 and Fig. 6 the proportion of different cells including splitters and buffers. One could observe that splitters only account for a very small portion of the total JJ counts. This is because AQFP supports up to 1-to-4 splitters, which is more flexible compared with 1-to-2 splitters in RSFQ. Buffers, on the other hand, account for larger portion of JJ counts in larger circuit benchmarks compared with smaller ones. The reason is because larger circuits have more paths and are more delay-balanced compared with smaller ones. The results demonstrate that the overhead of buffers will be reduced for larger-scale circuits, which is a good news for the future development of superconducting technology and applications. For the self-path-balanced circuits such as the approximate-parallel-counters and bitonic sorters, which are commonly used in the stochastic-computing based deep-learning accelerator17, the total JJ count can be further optimized. In addition, the experimental results of APCs show that AQFP circuits can benefit from majorty-based logic synthesis, as the three-input majority gates utilize the same JJ resources as two-input AND/OR gates in AQFP.

Table 4 Josephson-junction counts and proportion of different cells in 18 benchmark circuits using AQFP technology.
Figure 6
figure 6

Proportion of inserted buffers with different total cell counts in 11 benchmark circuits using AQFP technology.

We further estimated the chip area that required for the synthesized circuits using 10 kA/cm2 superconducting process with a minimum wire width of 50 nm. Considering the superconductor fabrication is still at an early development stage, the integration density is about 4× to 26× lower when comparing to the TSMC 40 nm CMOS technology. In the future, these numbers can be further reduced by introducing multi-layer techniques29 and transformer-free AQFP technique30.

In a nutshell, the synthesis results of AQFP circuits are highly promising in terms of energy-efficient and high-performance computing. With the future advancing and maturity of AQFP fabrication technology, we will anticipate broader applications ranging from space applications and large-scale computing facilities such as data centers.

Methods

AQFP standard cell library

The basic logic structure of AQFP circuits is a buffer consisting of a double-Josephson-junction SQUID31, as shown in the shaded area of Fig. 7(b). An AQFP logic gate is basically driven by AC-power, which serves as both excitation current and power supply.

Figure 7
figure 7

(a) Symbol of an AQFP majority gate, which consists of 3 buffers and a 3-to-1 merger (branch); (b) Junction-level schematic of an AQFP majority gate.

As shown in the shaded area of Fig. 7(b), excitation fluxes are applied to the superconducting loops via inductors L1, L2, Lx1 and Lx2 by applying the excitation current Ix, which is usually in the order of hundreds of micro-amperes (e.g. ±800 μA). One single flux quantum is either stored in the left or right loop, depending on the small input current Iin with a typical value of several micro-amperes (e.g. ±5 μA). As a result, the device can work as a buffer cell and the logic state can be represented by the direction of the output current Iout. In this way, the existence of a quantum flux either in the left or the right loop can be encoded as logic ‘1’ or ‘0’.

Furthermore, the AQFP inverter and constant cells (constant ‘1’ or ‘0’ cells) are designed from the AQFP buffer. The AQFP inverter is designed by negating the coupling coefficient of the output transformer in an AQFP buffer, whereas the AQFP constant gates are created by introducing asymmetry in the inductors for excitation fluxes in the standard AQFP buffer design. The key characteristic of AQFP logic is that different type of gates are built from AQFP buffer. This characteristic offers effective design methodology for a standard cell library and ensures the robustness with respect to circuit parameters, as long as the AQFP buffer is carefully designed in terms of the symbolic view and the physical layout.

With the presented basic building blocks (buffer, inverter, constant ‘0’ and constant ‘1’ gates), it is very effective to build an AQFP standard cell library by introducing the minimalist design approach15, i.e., designing more complicated gates using a bottom-up manner. For instance, majority gates are designed by merging the outputs of three buffers through a 3-to-1 branch as shown in Fig. 7. More examples on AND, NAND, and splitter are presented in Fig. 8. As an example, the AQFP AND gate is implemented similar to the majority gate with one of the three AQFP buffers replaced by a constant ‘0’ gate.

Figure 8
figure 8

(a) Symbol of an AQFP AND gate, which consists of 2 buffers, a constant ‘0’ and a 3-to-1 merger (branch); (b) Symbol of an AQFP AND gate, which consists of 2 inverters, a constant ‘1’ and a 3-to-1 merger (branch); (c) Symbol of an AQFP 1-to-3 splitter, which consists of a buffer and a 1-to-3 branch.

With our proposed minimalist design methodology, standard cell libraries have been built for two major 10 kA/cm2 Niobium processes, the high-speed standard process (HSTP)32 and the MIT-LL SFQ process33. Both cell libraries contain the same basic cells: BUFFER, INVERTER, AND, OR, MAJORITY, SPLITTER and off-chip INTERFACE, and they are designed with the optimized circuit parameters to achieve the best performance.

Different to the conventional CMOS technology, both combinational and sequential logic cells in AQFP cell libraries are driven by AC-power. The AC power serves as clock signal as well to synchronize the outputs of all gates in the same clock phase. As a result, data propagation in AQFP circuits is achieved by exploiting the overlapping of different clock signals in the neighboring phases. Figure 9 shows the meander structure of a typical clocking scheme of AQFP circuits (a) and an illustration of how data propagates between clock phases. The hardware-description-language (HDL) SystemVerilog34 is employed to build logical models for each individual logic cell, which will be further used for logic synthesis, timing analysis, and circuit verification35. These HDL models specify the input/output pins, logic functions, timing parameters and fabrication process. The presented HDL models are written in a parameterized approach, and can be easily modified for different fabrication technologies and low-level circuit parameters.

Figure 9
figure 9

(a) A typical meander structure of AQFP circuits using 4-phase clocking scheme; (b) Illustration of data propagation between clock phases.