Tree-based machine learning performed in-memory with memristive analog CAM

Tree-based machine learning techniques, such as Decision Trees and Random Forests, are top performers in several domains as they do well with limited training datasets and offer improved interpretability compared to Deep Neural Networks (DNN). However, these models are difficult to optimize for fast inference at scale without accuracy loss in von Neumann architectures due to non-uniform memory access patterns. Recently, we proposed a novel analog content addressable memory (CAM) based on emerging memristor devices for fast look-up table operations. Here, we propose for the first time to use the analog CAM as an in-memory computational primitive to accelerate tree-based model inference. We demonstrate an efficient mapping algorithm leveraging the new analog CAM capabilities such that each root to leaf path of a Decision Tree is programmed into a row. This new in-memory compute concept for enables few-cycle model inference, dramatically increasing 103 × the throughput over conventional approaches.

. Random forest training pre processing. The same procedure of the reference benchmark 2 was applied. KUL traffic sing data set consist of 32 × 32 256 levels RGB image that were reshaped to 16 × 16. After that an histogram equalization technique was applied and finally the image was converted to grey scale. After flattening to a 1 × 256 array, the image is applied to analog CAM DL and the matched class activate the corresponding memristors on the RAM array for majority voting. Figure 7. Digital to analog converter (DAC) design and DL power consumption calculation. (a) Schematic of the current steering DAC considered for this work 3 . R out was optimized by considering the clock time and parasitic resistance and conductance due to wires. (b) Schematic of a DL circuit where each analog CAM has a parasitic capacitance C DL and analog CAM cells are connected with each other with a wire of resistance R w . Elmore's theorem was considered to compute the propagation delay namely

Supplementary
, with H height of the analog CAM tile, or number of analog CAM cells in a column.

8/16
Supplementary Figure 8. Sense amplifier (SA) circuit schematic. V in divides the current of the input differential stage into the second stage unevenly. The balance causes the second stage to snap (either to V DD or V SS when the clocks are activated. The NADD gates at the output create a latch function of hold the value.

9/16
Supplementary Figure 9. Energy per decision as function of height (H) and width (W) of each analog CAM tile. It is possible to see that the trend is opposite compared to the energy per decision per node which is due to the optimal pipeline (i.e. continuous input data flow) considered. Figure 10. Breakdown of power consumption for different components, namely pre-charge and sense amplifiers (PC/SA), digital to analog converters (DAC), and analogCAM arrays (aCAM) for 180 nm (a) and 65 nm (b) technology node. By scaling the technology node the contribute in the overall power of analog CAM cells becomes dominant due to the conductance range which stays approximately fixed.Analog CAM cell transistor are scaled down with the technology node, but T1 and T3 are oversized to allow enough current for programming the memristor device to low conductance. In this case, the array dimensions are kept constant with H = 480 and W = 16.

Supplementary Note 1: Compact model details
Data collected from the post-tape out simulations were modeled with the following equations:  In tree based algorithm acceleration the figure of merit are normalized to the node operation 2 . In fact, the basic operation that needs to be performed during inference is the evaluation of each threshold and decision upon which direction to go. This include a comparison and a memory lookup for the next comparison to make. For KUL Belgium Traffic Signs dataset classification we implemented a RF with 15 trees and a maximum depth of 10 which resulted in a N o = 4000 nodes to be evaluated. Considering that each analog CAM cell can map n bits = 4, we consider a 4 bit operation as the minimum operation size. We can compute the performance τ for the pipelined architecture with throughput f as

Constant Value
and the energy efficiency η as Which corresponds to an improvement of 461× and 7.64× of the state of the art 2 .
Note that this calculation consider the peak performance, which is reached for an optimal usage of the memory, with all the memory elements used at each cycle. This should be consider not as the algorithmic performance but as the hardware performance. In fact, considering the algorithm, a conventional non parallel processor given a tree with ν nodes, needs to perform h = log 2 ν operations to traverse the tree and reach a leaf, which is typically referred as the height of a tree. By considering that, the total number of algorithmic operation N 1 that analog CAM performs at each cycle is Thus the algorithmic performance can also be computed. For simplicity, hardware and algorithmic performance of analog CAM and SRAM based 2 accelerators are summarized in The calculation highlights that the main benefit of analog CAM accelerator is the massive parallel operation, and that should be considering during the mapping procedure.

14/16 3 Supplementary Note 3: Area comparison with SRAM accelerator
Previously 3 , we optimized the cell layout of analog CAM at 16 nm technology along with the peripheral circuits. The resulting 6T2M analog CAM area is calculated to be A 6T 2M = 0.51µm 2 . To perform RF inference, the state of each match line (ML) needs to be store in a register. The register area per bit at 16 nm is A reg = 5.5µm 2 . After collecting all the ML results, the outputs need to be aggregated with a AND operation before performing the majority voting. Majority voting can be performed with traditional digital logic or with a matrix vector multiplication on a 1T1R RRAM array, as presented in the main text.
Here for simplicity we consider to compute the majority voting with traditional digital logic. Every logic per bit has an area of A logic = 0.28µm 2 · 2 = 0.56µm 2 where the factor 2 is to consider the logic placement utilization due to routing overhead. The area the current steering digital to analog converter (Supplementary Figure 8) to be A DAC = 10 · A · N ch + 5 · A mir where A u = 0.046µm 2 is the area of a transistor with a single finger, A mir = 0.061µm 2 is the area of the current mirror transistors and N ch the total number of desired channels.
Hence, the total area occupation of the analog CAM accelerator in 16 nm technology can be computed as: Where N arrays = 29 is the total number of arrays in the architecture, H = 480 and W = 16 is the number of rows and columns in each array respectively, N ch = W · N a rrays for the full pipelined implementation, A AND is the area of an AND gate, A MAJ = 3 · A AND the area of a majority voting gate, and OH = 0.2 is an extra overhead factor for placement and routing. With a fixed clock frequency of f clk = 1GHz the throughput at 16 nm remains the same of the previously calculated (Supplementary Information 2) at 65 nm, leading to the same performance τ. Thus the area efficiency α can be computed as To compare this result with the previously reported SRAM based accelerator 2 which is based on 65 nm technology, we need to scale it first to 16 nm. In TSMC process high density 6T SRAM bit cell has an area of 0.499µm 2 at 65 nm and 0.07344µm 2 at 16 nm. Hence the scaling factor is 0.499/0.07344 ∼ 6.8. Note however that in the reference, the bit cell area size is 2.11µm × 0.92µm = 1.94µm 2 which is several times larger than the high density bit cell. This is due to up-sizing of the pass-gate and pull-down transistors to reduce the resistance of the bit line discharge path, which helps to increase the bit line voltage discharge swing. It can be expected that the 16 nm bit cell would need to be up-sized in a similar way, keeping the ratio of ∼ 6.8. Thus we can estimate the area of the 6T SRAM bit cell at 16 nm as A 6T = (1.94µm 2 )/6.8 = 0.285µm 2 . Similarly, we can compute the scaling of the logic. Approximate metric for density scaling is the product of contacted poly pitch (CPP) and minimum metal pitch (MXP). In TSMC process CPP × MXP = 0.16 × 0.18µm 2 at 65 nm and CPP × MXP = 0.09 × 0.064µm 2 at 16 nm, leading to a scaling factor (0.16 × 0.18)/(0.09 × 0.064) = 5. Nevertheless, improvements in cell library design for advanced technology nodes can further improve this scaling ratio. For example, TSMC reports us to 2.35x density improvement from 65 to 40 nm, 2x improvements from 40 to 22 nm, and 1.44x improvements from 22nm to 16 nm, yielding to a cumulative logic density improvement of 6.8x from 65 to 16 nm. For considering a best-case scenario for designing the SRAM logic at 16 nm we consider a logic density scaling factor of 6.8x, which matches the scaling factor for the SRAM cell. We can directly measure the memory area (bit cell array) and logic (CTRL) from the layout 2 , which at 65 nm results in a total of A SRAM65 = 0.56mm 2 . By applying the aforementioned scaling rules we can compute the area at 16 nm as A SRAM16 = A SRAM65 /6.8 = 0.082mm 2 which leads to an area efficiency of While the area occupation for SRAM based accelerator is significantly lower compared to analog CAM accelerator, the resulting area efficiency is also drastically lower. This is due to the full parallelism that it is possible to achieve thanks to analog CAM. It can be thus concluded that the analog CAM accelerator is 142× more area efficient than the SRAM based accelerator 2 .