Abstract
In this work a novel architecture, named pseudosoftmax, to compute an approximated form of the softmax function is presented. This architecture can be fruitfully used in the last layer of Neural Networks and Convolutional Neural Networks for classification tasks, and in Reinforcement Learning hardware accelerators to compute the Boltzmann actionselection policy. The proposed pseudosoftmax design, intended for efficient hardware implementation, exploits the typical integer quantization of hardwarebased Neural Networks obtaining an accurate approximation of the result. In the paper, a detailed description of the architecture is given and an extensive analysis of the approximation error is performed by using both custom stimuli and realworld Convolutional Neural Networks inputs. The implementation results, based on CMOS standardcell technology, compared to stateoftheart architectures show reduced approximation errors.
Similar content being viewed by others
Introduction
The softmax function is one of the most important operators in the field of Machine Learning^{1}. It is used in the last layer in classification Neural Networks (NN) and also in Convolutional Neural Networks (CNN) to normalize the raw output of such systems.
The softmax function equation is:
where \(x_i\) are the outputs of a machine learning network and \(i=,1, \ldots , N\). In other words, the outputs of the network \(x_i\) are processed to represent the probability of the inference output \(p_i\) to belong to a certain class (Fig. 1).
In recent years, the literature proposed many hardware architectures for the inference process of NNs and CNNs both on ASIC and FPGA^{2,3,4}, characterized by high speed and low power consumption. The optimization in the hardware architectures is obtained both by the use of approximation algorithms and by the integer quantization of the arithmetic, usually by using 8 bits integers (INT8).
Unfortunately, the softmax function, unlike other operators used in Machine Learning^{5,6,7,8}, cannot be easily implemented because of the exponential and division operators. Moreover, even offtheshelf NN and CNN synthesis tools are not able to provide a hardware softmax implementation^{9,10}, and the function is computed by using a standard software approach.
In this work, we introduce the pseudosoftmax architecture with the aim to allow for an efficient hardware implementation of the softmax layer in hardware implemented NNs and CNNs.
Different solutions for the hardware implementation of the softmax function can be found in the literature but, unfortunately, each work focuses on different aspects of the design process, making comparisons not too easy.
In the following, recent relevant work on the softmax function is summarized by highlighting the most innovative aspects of each work.
Yuan^{11} proposed an implementation that uses a logarithmic transformation to avoid the division, but no final conversion to the original domain is considered. The exponential operations are simply carried out via Look Up Tables (LUT). A comparison on the number of operations performed by a standard LUTbased divisor and his proposed method is given.
Geng et al.^{12} proposed two architectures that compute the exponential function both via LUT and linear approximation. The division is carried out by finding the closest power of 2, thus only shift operations are needed. The accuracy was tested on real CNNs and an ASIC implementation is presented.
Li et al.^{13} proved that LUT implementations for the exponential function are the best tradeoff between precision and speed, if compared to Taylor’s series and CORDIC^{14} implementations. The division is performed by bitshifting. They presented two serial implementations both in FPGA and ASIC giving data about clock speed and resources. No information is provided on the latency of the architecture.
Baptista et al.^{15} proposed a High Level Synthesis (HLS) FPGA implementation for a specific CNN application. The exponents of the softmax formula are split into integer and fractional parts. The integer parts are evaluated by using a ROM approach, while polynomial approximation is used for the fractional parts. The results are given as the global accuracy of the overall Machine Learning system, not focusing on the softmax computation.
Wang et al.^{16} proposed an interesting architecture that exploits the fact that every number can be split in integer and fractional part. The implementation avoids any straightforward division or exponential operation by using Leading One Detectors (LOD), bit shifters, and constant multipliers. The authors considered the output of their system correct if the difference with respect to the original softmax value lies below a given threshold. The architecture was implemented both on ASIC and FPGA and information about the clock frequency, hardware resources, power dissipation, and throughput are provided.
Sun et al.^{17} proposed a FPGA serial implementation that splits every exponential operation in more operations to reduce the size of each ROM. The division is carried out via bitshifting. The authors provided information about the clock frequency, hardware resources, and power dissipation, but no data about the precision of the system is provided.
Hu et al.^{18} proposed their Integral Stochastic Computation (ISC) to evaluate the exponent operator. The division is avoided by a logarithmic transformation. No data about the precision of the system is provided. They implemented the architecture by using an FPGA but the maximum achievable clock frequency is not provided.
Du et al.^{19} proposed a tunable precision block for the exponentiation based on a variable number of LUTs. The architecture has been implemented both in ASIC and FPGA and data about clock frequency, hardware resources, and power dissipation are provided.
Kouretas and Paliouras^{20} implemented an approximated equation that takes into account only the exponents of the softmax formula and that replaces the summation with the highest input value. They compute the accuracy by using custom inputs and they show the hardware resources needed for an ASIC implementation of the architecture.
Wang et al.^{21} showed a CNN application that makes use of softwaretunable softmax layer in terms of precision, but no detailed data about the softmax implementation is provided.
Di Franco et al.^{22} proposed a straightforward FPGA implementation of the softmax by using a linear interpolating LUT for the exponential function. No information about the accuracy is provided.
Kouretas and Paliouras^{23} extended their previous work^{20} by improving the accuracy analysis by using realworld CNNs and by adding information about the ASIC implementation.
At the time of the writing, the work in^{23} can be considered the stateoftheart in hardware implementations of the softmax function, and it will be used for comparisons in the following sections.
Pseudosoftmax function
In order to simplify the computation of the softmax function in Eq. (1), we introduce a new approximated expression named pseudosoftmax:
in which the exponent base e is replaced by 2. An extensive analysis of the error introduced is discussed in the following section. As in the case of the softmax function, the summation of the pseudosoftmax outputs is always equal to one. Consequently, the values \(\widetilde{p}_i\) can be interpreted as probabilities.
As stated in the Introduction, the hardware implementations of NN systems make use of the integer quantization, typically 8bit integers (INT8). The reason of using powers of two \(2^{x_i}\) in Eq. (2) is that the integer numbers \(x_i\) can be interpreted as the exponent of floatingpoint (FLP) numbers, allowing for an efficient hardware implementation.
According to the conventional base2 FLP representation, a positive number a is represented as:
where b is the integer exponent and c is the fractional mantissa. Consequently, the numerator in Eq. (2) can be considered as a number \(a_i=2^{x_i}\), where \(b=x_i\) and \(c=0\), and Eq. (2) can be rewritten as
Similarly, for the denominator in Eq. (4), the sum can be rewritten as
and the pseudosoftmax function as
Substituting back \(a_i=2^{x_i}\) in Eq. (5), we obtain
that can be rewritten as
The expression Eq. (7) of the pseudosoftmax function shows that the output \(\widetilde{p}_i\) is a FLP number with exponent \((x_i\text {exp}_{\text {sum}})\), and with mantissa \(1/(1 \cdot \text {mant}_{\text {sum}})\), i.e., the reciprocal of the mantissa of the summation. The mantissa is common (constant) for all \(\widetilde{p}_i\)s, and it is only computed once.
Hardware architecture
The pseudosoftmax function in Eq. (7) is implemented by using the hardware architecture shown in Fig. 2.
As stated in the Introduction, the 8bit integers inputs \(x_i\) with range \([128,127]\) are interpreted as the exponents of FLP numbers. The denominator of Eq. (7) is the mantissa of the FLP number sum. The outputs are unsigned FLP numbers
represented by using 17 bits: 9bit exponent, and 8bit fractional mantissa with implicit integer bit (always 1). The 9bit exponent (unbiased) guarantees for overflows for maximum values \(x_i=127\) and number of inputs \(N<128\). There is no representation for zero, that can be determined by comparing \(\widetilde{p}_i\) to a sufficiently small threshold value. The negative exponent makes the floatingpoint number smaller than 1.0, but all output numbers are positive. Therefore, the sign bit is not necessary.
The unit in Fig. 2 is composed of three main blocks: a tree of FLP adders to compute \(sum=2^{\text {exp}_{\text {sum}}} \cdot 1 \cdot \text {mant}_{\text {sum}}\); a piecewise linear (PWL) interpolation block to compute the reciprocal, and an array of integers subtractors computing \((x_i\text {exp}_{\text {sum}})\).
In the following subsections, more detail on the main blocks in Fig. 2 is given.
The wordlenght sizes in the circuits are represented in gray characters, thin arrows represent 1bit signals.
Floatingpoint adder tree
We opted for a binary tree of FLP adders, that is modular and easy to design. If delay (for throughput) is problematic, the binary tree can be easily pipelined, after each adder, to meet the timing constraints.
The architecture of the FLP adder tree for N = 6 is shown in Fig. 3a.
The \(x_i\) of Eq. (2) are the exponents of FLP numbers and their mantissas is 1.0.
The architecture of the FLP adder^{24} is shown in Fig. 3b. Since it operates on positive FLP numbers, its architecture is simplified.
First, the exponents difference d is computed to find the amount of shifting necessary for the alignment of the mantissas. The largest exponent is selected as the exponent of the result. The alignment is performed by a barrel shifter (block \(\gg\) in Fig. 3b) by shifting d positions to the right the mantissa of the smallest number. If \(d\ge 8\), the mantissa of the smallest number is flushed to zero, and no actual addition is performed. When \(d=0\), same exponent, the addition of the normalized mantissas results in an overflow (\(\text{ mant } \ge 2.0\)) and the result must be normalized by incrementing by one the exponent, and by dividing the mantissa by two, i.e., rightshifting the result 1 position (block \(\gg 1\)).
An additional simplification is done for the FLP adder in the first level of the adder tree (Fig. 3c). Since, the input values \(x_i\) are power of two’s numbers and their mantissas is 1.0, there is no need to swap the mantissas (identical) according to d. The barrel shifter is also simplified because its input is the constant 1.0. When \(d=0\), i.e., \(x_i=x_j\), the result of the addition of the two mantissas is \(\text{ mant }=2.0\) (overflow). However, since the fractional bits are all zero, rightshifting is not necessary, and the normalization is done by incrementing the exponent of the result only.
Piecewise linear reciprocal block
In the computation of the probabilities \(\widetilde{p}_i\), the mantissa is common to all \(\widetilde{p}_i\)s, and consequently, a single reciprocal operation is sufficient. Moreover, because of the normalization, the mantissa is in the range [1, 2).
For the reciprocal \(y = 1/x\), we opted for a piecewise linear (PWL) polynomial approximation in two intervals
The coefficients of the polynomials were chosen, by incremental refinements, as the closest to powers of two to simplify the hardware. By expressing in binary the coefficients of (9) and as powers of two, we have
The resulting reciprocal approximation unit is shown Fig. 4a.
Since the intervals in Eq. (10) are determined by \(\text {mant}_{\text {sum}}\) being greater or smaller than 1.5, the MSB of the fractional part of the mantissa, bit with weight \(2^{1}\), is used to select the interpolating polynomial.
Figure 4b shows the plots of \(y=1/x\) and of the interpolating polynomials in [1.0, 2.0).
The reciprocal approximation error is obtained by exhaustive simulation in fixedpoint. The maximum absolute error is \(0.03125< 2^{5}\) obtained for \(x=1.0\) (Fig. 4b), while the average error is \(0.011151 < 2^{7}\). This is a good tradeoff between error and hardware complexity (Fig. 4a).
PseudoBoltzmann architecture for reinforcement learning hardware accelerators
The proposed pseudosoftmax formula in Eq. (2) can be adapted to implement the Boltzmann action selection policy^{25} for Reinforcement Learning (RL) systems. The design of an efficient architecture would allow for such policy to be implemented in the state of the art of RL hardware accelerators^{26,27}.
The formula of the Boltzmann policy is:
It is straightforward to see that Eq. (1) is a special case of Eq. (11) where the temperature coefficient \(\tau =1\) (Kelvin). In order to avoid the division operation by \(\tau\), we can consider a power of two approximation \(\tau =2^T\) obtaining the pseudoBoltzmann equation:
The corresponding hardware architecture is obtained with minor modifications of the pseudosoftmax architecture shown Fig. 2.
Results
In this section we provide extensive testings to analyze the precision of the proposed pseudosoftmax. An analysis on the quantization of the architecture is also provided. The PsedoSoftmax operator is compared to the hardwarebased softmax design illutsrated in^{23}. Then, we show the pseudosoftmax ASIC implementation results based on a 90 nm standardcell CMOS technology. The results are given in terms of propagation delay, silicon area and power dissipation. Our results are compared with the implementation in^{23}.
Approximation error analysis
In order to validate the pseudosoftmax function, we performed extensive simulations using both custom inputs and real data from Convolutional Neural Networks (CNN). We also compared our results with those obtained by the hardwarebased softmax implementation described in^{23}.
To easily compare the simulation results with those of^{23}, the performance is evaluated by computing the Mean Square Error (MSE)
Moreover, to be consistent with^{23}, the tests were performed in floatingpoint by assuming quantized integer inputs.
Custom generated inputs
One test consisted in applying random uniformly distributed inputs in the range \([2^{7},2^71]\) (INT8) to the pseudosoftmax module. The number of inputs N tested was in the range \(N=\)[2,1000], being N the number of classes of a neural network. For every chosen N, we applied 10,000 random patterns to the system’s inputs \(x_i\).
The MSE as a function of the number of inputs N, is shown in Fig. 5a.
The plot shows that the MSE decreases as the number of inputs increases.
We now compare our pseudosoftmax function to the design shown in^{23} by using the same network parameters: 10bit inputs an \(N=30\). The input values in^{23} are chosen in such a way to push the softmax hardware in appropriate “corner cases”.
For the first test case, the input values \(x_i\) are close to 5. These are shown in the second row of Figure 5b. The other rows in Fig. 5b display the values \(p_i\) for the softmax function, \(\hat{p}_i\) for the softmax approximation of^{23}, and \(\widetilde{p}_i\) for the pseudosoftmax.
Since the input values \(x_i\) in Fig. 5b are close to each other, also the softmax values \(p_i\) are rather close. The outputs \(\hat{p}_i\) are much larger and their sum is larger than 1, violating the principle that the softmax is a probability density function (PDF). In contrast, the outputs \(\widetilde{p}_i\), preserve the features of PDFs and all values are close to \(p_i\).
The MSE value for \(\widetilde{p}_i\) is \(\text {MSE}_{\widetilde{p}_i} = 2.7082 \times 10^{4}\), while \(\text {MSE}_{\hat{p}_i} = 0.8502\).
Figure 5c reports the results of the second test case using the same organization of Fig. 5b.
In this case, the first four inputs \(x_i\) are significantly larger than the remaining \(x_i < 1\). Also in this case, the sum of \(\hat{p}_i\) outputs is larger than 1, while the \(\widetilde{p}_i\) mantains a PDF behavior. However, the MSE of the approximation are almost the same, since \(\text {MSE}_{\widetilde{p}_i} = 0.0019\) and \(\text {MSE}_{\hat{p}_i} = 0.0018\).
Convolutional Neural Networks benchmarks
The performance of the pseudosoftmax approximation algorithm is also evaluated with real data using the set of tests performed in^{23} based on standard CNNs.
The test is based in the ImageNet dataset^{28} consisting in classifying 1000 images. The test is performed by 10,000 inferences on the following networks:

1.
ResNet50^{29},

2.
VGG16^{30},

3.
VGG19^{30},

4.
InceptionV3^{31},

5.
MobileNetV2^{32}.
In Fig. 6 the histograms of the MSE are shown, overlapped with the error values obtained by^{23} in the same test sets. To be consistent with the comparison, all the networks have been quantized to 10 bits. The histograms for^{23} are derived from the figure in the paper and may not be very accurate.
All the MSE values lay in the range \([0,10 \times 10^{4}]\). The results of the PseudoSoftmax inference shows that this method is one order of magnitude better in approximation error than the method used in^{23}.
Inputs quantization analysis
As stated in the Introduction, typical hardware implementations of NNs are based on INT8 quantization. To see the impact of the NN quantization in the ImageNet test, the softmax MSE error histogram is evaluated while reducing the wordlenght of the inputs values \(x_i\). In Fig. 7a the MSE values for 8 and 10 bits quantized VGG16 networks are very similar. and therefore, PsuedoSoftmax architecture is quite insensitive for the quantization in that range of bits.
Similar results are obtained for the other tested networks.
By further reducing the input wordlenght, we obtain the minimum MSE achieved in^{23} (\(10^{3}\)) for the 10 bits quantization when the inputs to our pseudosoftmax unit are 3 bits. The comparison of the MSE for the 10,000 patterns of the two methods applied to VGG16 is illustrated in Fig. 7b. The histogram for^{23} is derived from the figure in the paper and may not be very accurate.
Implementation results
We implemented the pseudosoftmax architecture by using a 90 nm 1.0 V CMOS standardcell library. Since the standardcell library is the same feature size than the one used in^{23}, although the vendors may be different, the comparison of the implementation results is sufficiently fair. The synthesis was performed by using Synopsis Design Compiler. We considered 10 classes (N = 10) architectures.
The first implementation of the pseudosoftmax unit is for a INT8 input and N = 10 architecture. The results are reported in Fig. 8a. The input to output delay is 3.22 ns (the unit is not pipelined). The power dissipation is evaluated at the maximum operating frequency of 310 MHz.
Based on the result of the quantization analysis, the second implementation is a pseudosoftmax unit with 3bit inputs. This unit gives a similar \(\text {MSE} \propto 10^{3}\) as the unit in^{23}.
The result of the comparison are displayed in Fig. 8b. For “Architecture in^{23}”, we rewrote the values from the paper for the fastest architecture identified as “Figure 2a”. The power dissipation was evaluated at 300 MHz.
By comparing the results in Fig. 8b, the delay is the same, the area of the pseudosoftmax is about 30% larger than the unit in^{23}, and the power is not really comparable because we do not have any info on the clock frequency used to evaluate the power dissipation in^{23}.
However, since the pseudosoftmax unit requires only 3bit inputs for the same MSE, it is reasonable to assume that the neural network driving it can be quantized at a narrower bitwidth and be significantly smaller than a network producing 10bit \(p_i\)s.
In Fig. 8c,d we provide the area and power dissipation for different INT8 and INT3 implementations, varying the number of inputs. We set the synthesis tool to a timing constraint of 100 MHz, which is the maximum achievable frequency of the larger architecture (INT8, 32 inputs). The power dissipations were evaluated considering this frequency.
Except for the PWL reciprocal block, the hardware resources are strictly related to the number of inputs and the quantization. Moreover, it can be observed how the area required for the I/O registers, the FLP adder tree, and the array of subtractors, doubles when we double the number of inputs.
Discussion
In this paper, we proposed a pseudosoftmax approximation of the softmax function and its hardware architecture. The approximation error, measured by the MSE, is smaller than other softmax approximations recently presented.
Moreover, the pseudosoftmax function follows the property of probability distributions and its output values can be interpreted as probabilities.
Beside artificial NNs, the pseudosoftmax approximation method can be adapted to implement the Boltzmann action selection policy used in Reinforcement Learning.
The pseudosoftmax architecture has been implemented in VHDL and synthesized in standardcells. The implementation results show that although the area of the pseudosoftmax unit is larger than the unit in^{23}, its reduced inputs bitwidth can lead to an area reduction of the driving NN.
In a future extension of this work, the pseudosoftmax architecture could be rearranged to work with serial or mixed parallel–serial inputs. This would allow its hardware implementation in networks with a high number of output classes.
Change history
01 September 2021
A Correction to this paper has been published: https://doi.org/10.1038/s41598021970799
References
Bishop, C. M. Pattern Recognition and Machine Learning, Chapter 2 (Springer, 2006).
Capra, M. et al. An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks. Future Internet 12, 113 (2020).
Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R. & Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 6869–6898 (2017).
Guo, K., Zeng, S., Yu, J., Wang, Y. & Yang, H. [dl] a survey of fpgabased neural network inference accelerators. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 12, 1–26 (2019).
Alwani, M., Chen, H., Ferdman, M. & Milder, P. Fusedlayer CNN accelerators. in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 1–12 (IEEE, 2016).
Cardarilli, G. C. et al. A parallel hardware implementation for 2D hierarchical clustering based on fuzzy logic. IEEE Trans. Circ. Syst. II Express Briefs. 68, 1428–1432 (2021).
Sze, V., Chen, Y.H., Yang, T.J. & Emer, J. S. Efficient processing of deep neural networks. Synth. Lect. Comput. Arch. 15, 1–341 (2020).
Anguita, D., Boni, A. & Ridella, S. A digital architecture for support vector machines: Theory, algorithm, and FPGA implementation. IEEE Trans. Neural Netw. 14, 993–1009 (2003).
Xilinx, Inc. Vitis AI User Guide—UG1414. Chapter 4, Version 1.3. (2021). https://www.xilinx.com/support/documentation/sw_manuals/vitis_ai/1_3/ug1414vitisai.pdf. Accessed 29 April 2021.
The MathWorks, Inc. Deep Learning HDL ToolboxUser’s Guide, 1.1 edn. Chapter 7, Version 1.1. (2021). https://it.mathworks.com/help/pdf_doc/deeplearninghdl/dlhdl_ug.pdf. Accessed 29 April 2021.
Yuan, B. Efficient hardware architecture of softmax layer in deep neural network. in 2016 29th IEEE International SystemonChip Conference (SOCC), 323–326 (IEEE, 2016).
Geng, X. et al. Hardwareaware softmax approximation for deep neural networks. in Asian Conference on Computer Vision, 107–122 (Springer, 2018).
Li, Z. et al. Efficient fpga implementation of softmax function for DNN applications. in 2018 12th IEEE International Conference on Anticounterfeiting, Security, and Identification (ASID), 212–216 (IEEE, 2018).
Volder, J. The cordic computing technique. in Papers Presented at the March 3–5, 1959, Western Joint Computer Conference, 257–261 (1959).
Baptista, D., Mostafa, S. S., Pereira, L., Sousa, L. & MorgadoDias, F. Implementation strategy of convolution neural networks on field programmable gate arrays for appliance classification using the voltage and current (VI) trajectory. Energies 11, 2460 (2018).
Wang, M., Lu, S., Zhu, D., Lin, J. & Wang, Z. A highspeed and lowcomplexity architecture for softmax function in deep learning. in 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), 223–226 (IEEE, 2018).
Sun, Q. et al. A high speed softmax VLSI architecture based on basicsplit. in 2018 14th IEEE International Conference on SolidState and Integrated Circuit Technology (ICSICT), 1–3 (IEEE, 2018).
Hu, R., Tian, B., Yin, S. & Wei, S. Efficient hardware architecture of softmax layer in deep neural network. in 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), 1–5 (IEEE, 2018).
Du, G. et al. Efficient softmax hardware architecture for deep neural networks. in Proceedings of the 2019 on Great Lakes Symposium on VLSI, 75–80 (2019).
Kouretas, I. & Paliouras, V. Simplified hardware implementation of the softmax activation function. in 2019 8th International Conference on Modern Circuits and Systems Technologies (MOCAST), 1–4 (IEEE, 2019).
Wang, K.Y., Huang, Y.D., Ho, Y.L. & Fang, W.C. A customized convolutional neural network design using improved softmax layer for realtime human emotion recognition. in 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), 102–106 (IEEE, 2019).
Di Franco, F., Rametta, C., Russo, M. & Vaccaro, M. An hardware implementation of the softmax function on FPGA. In Proceedings of the International Conference for Young Researchers in Informatics, Mathematics, and Engineering 2020 Vol. 2768 (ed. Wozniak, M. C. G.) 21–25 (CEURWS, 2020).
Kouretas, I. & Paliouras, V. Hardware implementation of a softmaxlike function for deep learning. Technologies 8, 46 (2020).
Patterson, D. A. & Hennessy, J. L. Computer Organization and Design MIPS Edition: The Hardware/Software Interface. Chapter 3 6th edn. (Morgan Kaufmann, 2020).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An introduction (MIT Press, 2018).
Cardarilli, G. et al. An actionselection policy generator for reinforcement learning hardware accelerators. Lect. Notes Electr. Eng. 738, 267–272 (2021).
Spano, S. et al. An efficient hardware implementation of reinforcement learning: The qlearning algorithm. IEEE Access 7, 186340–186351 (2019).
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115, 211–252. https://doi.org/10.1007/s112630150816y (2015).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826 (2016).
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510–4520 (2018).
Author information
Authors and Affiliations
Contributions
G.C.C. supervised the research, the other authors contributed equally to this work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this Article was revised: The original version of this Article contained an error in Figure 3 where panels (b) and (c) were incorrectly captured.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cardarilli, G.C., Di Nunzio, L., Fazzolari, R. et al. A pseudosoftmax function for hardwarebased high speed image classification. Sci Rep 11, 15307 (2021). https://doi.org/10.1038/s41598021946917
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598021946917
This article is cited by

An efficient transfer learning approach for prediction and classification of SARS – COVID 19
Multimedia Tools and Applications (2023)

Fluoropolymerbased organic memristor with multifunctionality for flexible neural network system
npj Flexible Electronics (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.