Fundamental energy cost of finite-time parallelizable computing

The fundamental energy cost of irreversible computing is given by the Landauer bound of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$kT\ln 2$$\end{document}kTln2/bit, where k is the Boltzmann constant and T is the temperature in Kelvin. However, this limit is only achievable for infinite-time processes. We here determine the fundamental energy cost of finite-time parallelizable computing within the framework of nonequilibrium thermodynamics. We apply these results to quantify the energetic advantage of parallel computing over serial computing. We find that the energy cost per operation of a parallel computer can be kept close to the Landauer limit even for large problem sizes, whereas that of a serial computer fundamentally diverges. We analyze, in particular, the effects of different degrees of parallelization and amounts of overhead, as well as the influence of non-ideal electronic hardware. We further discuss their implications in the context of current technology. Our findings provide a physical basis for the design of energy-efficient computers.

The energy cost of partially parallelizable problems (see Eq. (7) in the main text) is derived from three observations: (i) the problem can be split into a serial and a parallel part, N = N s + N p = sN + pN , (ii) corresponding to a total duration T = T s + T p , where the respective serial and parallel time allocations, T s and T p , can be freely chosen, except for the fixed total time constraint. Since we are interested in the optimal energy cost, it is important to further (iii) optimize the energy cost function over T p , respecting the time constraint. The computation times for a single algorithmic operation, for either the serial or parallel computation, τ s and τ p are thus related by, T s = τ s N s and T p = τ p N p /n(N p ). (S1) The total energy cost follows from Eq. (1) as, W com tot = N kT ln 2 + a τ s N s + a τ p N p = N kT ln 2 + s 2 N a T s + p 2 aN T p n(N p ) .

(S2)
Minimizing with respect to T p using (iii), we obtain for all n(N p ), Inserting this result into (S2), we then find, Equation (7) eventually follows with n(N p ) = bN p.

S2. PARALLELIZATION OVERHEAD
There are many possible kinds of overhead. In the main text, we consider the simple linear form, which is linear in the number of processors and thus linear in the problem size for an ideal parallel computer. For simplicity, we assume that all available processors are used and therefore n(N ) is fixed. The overhead is accordingly taken into account by modulating the computation speed. With N g = N + N ove (n) and τ p = T n(N )/N , the general time needed to solve the problem including the linear overhead is, The given time constraint is T = T = N τ p /n(N ) for the overhead-free case. We thus have, For the overhead limited bound of finite time parallel computing, the total energy cost is finally, (S7) Equation (S7) holds for any overhead function N ove (n).
The precise form and scaling of parallel overhead depends on the nature of the problem being considered, as well as on its exact implementation. The overhead costs of serial and parallel implementations of various scientific programs have been investigated in Ref. 1 . Different scaling behaviors of the overhead function were evaluated for the following examples: (i) The Rabin-Miller test, which checks whether or not a number is a prime, corresponds to N ove (n) ∝ n (Fig. S1, green line); (ii) A massive parallel simulation (using up to 250 thousand processors) of fluid flow and mass transport using the lattice-Boltzmann method, may be approximated by N ove (n) ∝ n ln(n) (Fig. S1, red line); (iii) The lowerupper (LU) decomposition of a n × n Pascal matrix, that scales as N ove (n) ∝ n 2 (Fig. S1, brown line). According to our model of parallel computing with overhead (S7), any overhead that behaves better than N ove (n) ∝ n 3/2 ( Fig. S1 dotted purple line), results in an energetic benefit for parallel computing, compared to serial computing. Note, however, that modern day (non-ideal) electronic computers have a worse finite time scaling (see discussion in S4), meaning that for electronic computers, any overhead that behaves better than N ove (n) ∝ n 5/3 , results in a benefit for parallel computing compared to serial computing.

S3. MORE OPERATIONS PER PARALLEL PROCESSOR
In the main text, we have defined the number of processors for a parallel computer as n(N ) = bN (with b ∈ (0, 1]) and set b = 1 throughout. The constant b thus determines the number of operations performed by each processor as ν = 1/b. To illustrate the effect on the energy consumption, if we assume that one parallel processor performs 1, 1000 and 1000000 operations, we plot the energy consumption per operation for a parallel processor with b = 1, b = 0.001 and b = 0.000001, re- spectively, compared to a serial processor (Fig. S2). This increases the work per operation for the parallel processor by a constant factor, similar to a linear parallelization overhead (see main text Fig. 4b).

S4. NON-IDEAL COMPUTER HARDWARE
So far, we have focused on optimal computer hardware, but real computers have additional costs. For one, we have considered the optimal 1/T order, which implies a linear increase in energy dissipation with operation frequency f op . However, for today's electronic computers, the dynamic energy dissipation per computation grows approximately with W dyn (V, f op ) = γV 2 5 , where γ is a circuit-specific constant that we assume to be part of our constant a and V is the supply voltage. Further costs are incurred by leakage currents W lea = αV T 5 . The overall energy dissipation for non-ideal hardware and N computations is hence W non (V, f op ) = W dyn (V, f op )N + W lea (V ) + W pro = γV 2 N + αV T + β. Because the supply voltage depends linearly on the frequency 5 , we substitute V for V = µf op and absorb γ, µ into a, giving us an overall energy dissipation of W non (V, f op ) = af 2 op N + αf op T . The energetic cost of the ideal serial computer then becomes For a parallel computer without overhead, each core has a constant operation frequency of f par = 1/(bT ). Therefore, the dynamic energy consumption has a in n constant effect. Further, the leakage currents dissipate additional energy for each core that is added, meaning that the energy consumption scales linearly with the number of cores W par (n) = nW lea (f op ) = nαf op T . Also, provisioning work for parallel computer might be necessary W pro = β 5 . As a result, the energetic cost of parallel computers becomes Interestingly, the cost incurred by the leakage currents are the same for both the serial (S8) and the parallel (S9) case, due to the fact, that the supply voltage dependence of these currents scales linearly with the operation frequency 5 . These costs are further also constant in total computation time T , due to the fact, that a change in T is accompanied with a modification of the operation frequencies f op . The provisioning term β/N contributes to an increase in energetic cost for the parallel computer, which decreases with the problem size, becoming marginal for sufficiently large problems N . Without algorithmic overhead, the parallel computer has an even better scaling behavior than the serial computer, only limited by the size of the provisioning initially needed. Further interesting behavior can be found, when algorithmic overhead (S7) is considered with the non-ideal computer hardware. For general finite time scaling W dyn = a/τ m , one has the general form of the work cost W ser tot N = kT ln 2 + a T m N m + α (S10) for the serial computer, and W par tot N = 1 + N ove (n) N kT ln 2 + a 1 + N ove (n)/N τ p m + α 1 + N ove (n) N + β N (S11) for the parallel computer, taking into account also the algorithmic overhead. The interplay between overhead and non-ideal properties of the computer results in a parallel work cost (S11) that now scales with the order m of the finite time dynamical work cost W dyn . From (S10), (S11), one is able to determine how large the algorithmic parallel overhead scaling can be, while still allowing for an energetically favorable parallel solution. For large N , the leakage currents and provisioning work becomes insignificant compared to the dynamical part in (S11), as the latter will always scale at least quadraticly with the overhead N ove (n)/N, and m ≥ 1, while the former scale at maximum linearly.
The relevant m order behavior for the serial computer will then be W ser tot /N ∝ 1/τ m s = N m /T m , while for the parallel computer one has for the highest m order behavior W par tot /N ∝ (N ove /N ) m+1 ∝ (N k /N ) m+1 , for N ove ∝ n k ∝ N k and n = bN . It thus follows, that the serial and parallel N scaling becomes identical, when N 2m+1 = N k·(m+1) , or k = (2m + 1)/(m + 1). If the overhead scales worse than k ≥ 2, then no finite time scaling m will satisfy this condition, and a parallel computer would scale energetically worse than the serial realization. For modern day electronic computers, the finite time scaling is given by m = 2, which would correspond to an energetically favorable scaling of the overhead when k ≤ 5/3.

S5. REVERSIBLE COMPUTING
The Landauer limit of heat dissipation of kT ln 2 per bit only applies for irreversible computing operations. For reversible computing this bound can in principle be brought down to zero [2][3][4] . In that case, the Landauer term kT ln 2 may be dropped from Eqs. (3) and (4)  (S13) For an ideal serial computer, the difference between reversible ( Fig. S4, green line) and irreversible (Fig. S4, blue dotted line) computing quickly becomes negligible as the number of operations per second increases. In this case, the increasing dissipative term dominates over the quasistatic limit -explaining why reversible computing has so far not gained any practical relevance. In contrast, a parallel computer, can take full advantage of the lower quasistatic limit of reversible computing (compare Fig. S4, red and orange lines). This indicates that future highly parallel computers may benefit from reversible circuit implementations.