## Introduction

With the recent advances in machine vision and its applications, there is a growing demand for sensor hardware that is faster, more energy-efficient, and more sensitive than frame-based cameras, such as charge-coupled devices (CCDs) or complementary metal–oxide–semiconductor (CMOS) imagers1,2. Beyond event-based cameras (silicon retinas)3,4, which rely on conventional CMOS technology and have reached a high level of maturity, there is now increasing research on novel types of image acquisition and data pre-processing techniques5,6,7,8,9,10,11,12,13,14,15,16,17,18, with many of them emulating certain neurobiological functions of the human visual system.

One image pre-processing technique, that is being used since decades, is pixel binning. Binning is the process of combining the electric signals from $$K$$ adjacent detector elements into one larger pixel. This offers benefits such as (1) increased frame rate due to a $$K$$-fold reduction in the amount of output data, and (2) an up to $$K^{1/2}$$-fold improvement in signal-to-noise ratio (SNR) at low light levels or short exposure times19. The latter can be understood from the fact that dark noise is collected in normal mode for every detector element, but in binned mode only once per $$K$$ elements. Binning, however, comes at the expense of reduced spatial resolution or, in more general terms, loss of information. In pattern recognition applications this reduces the accuracy of the results even if the SNR is high.

Here, we push the concept of binning to its limit by combining a large fraction of the sensor elements into a “superpixel” whose optimal shape is determined from training data using a machine learning algorithm. We demonstrate the classification of optically projected images on an ultrashort timescale, with enhanced dynamic range and without loss of classification accuracy.

## Results and discussion

### Pixel binning

In Fig. 1 we schematically depict different types of binning and its impact on the classification accuracy of an artificial neural network (ANN). Besides the aforementioned conventional approach (orange lines), we also illustrate our concept of data-driven binning (green line). There, a substantial fraction of pixels are combined into a “superpixel” that extends over the whole face of the chip, thus forming a large-area photodetector with a complex geometrical structure that is determined from training data. For multi-class classification with one-hot encoding, one such superpixel is required for each class. As for conventional binning, the system becomes more resilient towards noise and its dynamic range increases. However, for large light intensities there is no loss of classification accuracy and hence no compromise in performance, in contrast to the conventional case. These benefits come at the cost of less flexibility, as a custom configuration/design is required for each specific application.

### Photosensor implementation

Figure 2a shows a schematic of our photosensor, employing data-driven binning. A microscope photograph of the actual device implementation is shown in Fig. 2b. For details regarding the fabrication, we refer to the “Methods” section. The device consists of $$N$$ pixels, arranged in a two-dimensional array. Each pixel is divided into at most $$M$$ subpixels that are connected–binned–together to form the $$M$$ superpixels, whose output currents are measured. Each detector element is composed of a GaAs Schottky photodiode (Fig. 2c) that is operated under short-circuit conditions (Fig. 2d) and exhibits a photoresponsivity of $$R = I_{{{\text{SC}}}} /P \approx$$ 0.1 A/W, where $$I_{{{\text{SC}}}}$$ is the photocurrent and $$P$$ the incident optical power. GaAs was chosen because of its short absorption and diffusion lengths, which both reduce undesired cross-talk between adjacent pixels; with some minor modifications the sensor can also be realized using Si instead of GaAs. The design parameters, that depend on the specific classification task and are determined from training data, are the geometrical fill factors $$f_{{{\text{mn}}}} = A_{{{\text{mn}}}} /A$$ for each of the subpixels, where $$A_{{{\text{mn}}}}$$ denotes the subpixel area and $$A$$ is the total area of each pixel. From Fig. 2a, we find for the $$m$$ output currents $$I_{{\text{m}}} = R\mathop \sum \limits_{{{\text{n}} = 1}}^{{\text{N}}} f_{{{\text{mn}}}} P_{{\text{n}}}$$, or

$${\mathbf{i}} = R{\mathbf{Fp}},$$
(1)

with $${\mathbf{p}} = \left( {P_{1} ,P_{2} , \ldots ,P_{{\text{N}}} } \right)^{T}$$ being a vector that represents the optical image projected onto the chip, $${\mathbf{i}} = \left( {I_{1} ,I_{2} , \ldots ,I_{{\text{M}}} } \right)^{T}$$ the output current vector, and $${\mathbf{F}} = \left( {f_{{{\text{mn}}}} } \right)_{{{\text{M}} \times {\text{N}}}}$$ a fill factor matrix that depends on the specific application. The $$m$$-th row of $${\mathbf{F}}$$ is a vector $${\mathbf{f}}_{{\text{m}}} = \left( {f_{{{\text{m}}1}} ,f_{{{\text{m}}2}} , \ldots ,f_{{{\text{m}}N}} } \right)^{T}$$ that represents the geometrical shape of the $$m$$-th superpixel.

### Naïve Bayes photosensor

Let us now discuss how to design the fill factor matrix for a specific image recognition problem. As an instructive example, we present the classification of handwritten digits (‘0’, ‘1’, …, ’9’) from the MNIST dataset20 by evaluating the posterior $${\mathbb{P}}\left( {y_{{\text{m}}} {|}{\mathbf{p}}} \right)$$ (the probability $${\mathbb{P}}$$ of an image $${\mathbf{p}}$$ being a particular digit $$y_{{\text{m}}}$$) for all classes and selecting the most probable outcome. By applying Bayes' theorem and further assuming that the features (pixels) are conditionally independent, one can derive a predictor of the form $$\hat{y}_{{\text{m}}} = {\text{arg max}}_{{{\text{m}} \in \left\{ {1 \ldots {\text{M}}} \right\}}} {\mathbb{P}}\left( {y_{{\text{m}}} } \right) \mathop \prod \limits_{{{\text{n}} = 1}}^{{\text{N}}} {\mathbb{P}}\left( {P_{{\text{n}}} {|}y_{{\text{m}}} } \right)$$, known as Naïve Bayes (NB) classifier21,22. We use a multinomial event model $${\mathbb{P}}\left( {P_{{\text{n}}} {|}y_{{\text{m}}} } \right) = \pi_{{{\text{mn}}}}^{{P_{{\text{n}}} }}$$, where $$\pi_{{{\text{mn}}}}$$ is the probability that the $$n$$-th pixel for a given class $$y_{{\text{m}}}$$ exhibits a certain brightness and express the result in log-space to obtain a linear discriminant function

$$\hat{y}_{{\text{m}}} = \mathop {\text{arg max}}\limits_{{{\text{m}} \in \left\{ {1 \ldots {\text{M}}} \right\}}} \left( {{\mathbf{Wp}} + {\mathbf{b}}} \right)_{{\text{m}}}$$
(2)

with weights $$w_{{{\text{mn}}}} = \log \pi_{{{\text{mn}}}}$$. The bias terms $$b_{{\text{m}}} = \log {\mathbb{P}}\left( {y_{{\text{m}}} } \right)$$ can be omitted ($${\mathbf{b}} = 0$$), as all classes are equiprobable. The similarity to Eq. (1) allows us to map the algorithm onto our device architecture: $${\mathbf{F}} \propto {\mathbf{W}}$$. To match the calculated $$w_{{{\text{mn}}}}$$-value range to the physical constraints of the hardware implementation,

$$0 \le f_{{{\text{mn}}}} \le 1 \ {\text{ and }} \ \mathop \sum \limits_{{\text{m}}} f_{{{\text{mn}}}} \le 1,$$
(3)

we normalize the weights according to

$$f_{{{\text{mn}}}} = \frac{{w_{{{\text{mn}}}} - \min w_{{{\text{mn}}}} }}{{\max \mathop \sum \nolimits_{{\text{m}}} (w_{{{\text{mn}}}} - \min w_{{{\text{mn}}}} )}}.$$
(4)

In Fig. 3a we exemplify the working principle of the photosensor. A sample $${\mathbf{p}}$$ from the MNIST dataset is optically projected onto the chip using the measurement setup shown in Fig. 3b (see “Methods” section for experimental details). Each of the $$M$$ superpixels generates a photocurrent $$I_{{\text{m}}}$$ proportional to the inner product $${\mathbf{f}}_{{\text{m}}}^{T} {\mathbf{p}}$$. If we visualize $${\mathbf{f}}_{{\text{m}}}$$ for each class (Fig. 3c), we obtain an intuitive result: The shape of each superpixel resembles that of the average-looking digit for the respective class. It is apparent that the superpixel with the largest spatial overlap with the image delivers the highest photocurrent.

Figure 3e shows experimental photocurrent maps for the device in Fig. 2b. Here, each pixel of the sensor is illuminated individually and the output currents are recorded. The currents are proportional to the designed fill factors in Fig. 3c (apart from device imperfections such as broken lithographic connections), confirming negligible cross-talk between neighbouring subpixels. To evaluate the performance, we projected all 104 digits from the MNIST test dataset and recorded the sensor’s predictions. The classification results are presented as a confusion matrix in Fig. 3f. The chip is able to classify digits with an accuracy that closely matches the theoretical result in Fig. 3d.

### Artificial neural network photosensor

Beyond the instructive example of NB, the same device structure also allows the implementation of other, more accurate, classifiers. Specifically, we present the design and simulation results for a single-layer ANN21 for the same MNIST classification task as discussed before. In Fig. 4a the architecture of the network is shown. It makes its predictions according to

$$\hat{y}_{{\text{m}}} = \mathop {\text{arg max}}\limits_{{{\text{m}} \in \left\{ {1 \ldots {\text{M}}} \right\}}} \sigma \left( {{\mathbf{Wp}} + {\mathbf{b}}} \right)_{{\text{m}}}$$
(5)

Note the similarity to Eq. (2), apart from a nonlinearity $$\sigma$$ which can be readily implemented, either in the analogue or the digital domain, using external electronics. We choose a softmax activation function for $$\sigma$$. Again, due to the physical constraints of the sensor hardware, we train the network with bias $${\mathbf{b}} = 0$$ using categorical cross-entropy loss. In order to obey Eq. (3), we further introduce a constraint that enforces a non-negative weight matrix $${\mathbf{W}}$$ by performing the following regularization after each training step:

$${\mathbf{W}} \leftarrow {\mathbf{W}} \odot \theta \left( {\mathbf{W}} \right),$$
(6)

with $$\odot$$ denoting the Hadamard product and $$\theta$$ the Heaviside step function. This leads to a < 1% penalty in accuracy.

The fill factor matrix $${\mathbf{F}}$$, plotted in Fig. 4d, is directly related to $${\mathbf{W}}$$ by a geometrical scaling factor. Although the superpixel shapes do not clearly resemble the handwritten digits, the ANN shows better performance than the NB classifier, as demonstrated by the confusion matrix in Fig. 4b. In addition, the ANN shows a larger spread between the highest and all other output currents (Fig. 4c), which makes it more robust against noise (Supplementary Figure S2). A number of other machine learning algorithms can be described by an equation of the form (5) and can be implemented in a similar fashion. Also the realization of an all-analogue deep-learning network is feasible by feeding the sensor output into a memristor crossbar array24,25.

### Benefits of data-driven binning

In Fig. 5 we demonstrate the benefits of data-driven binning. It is evident that the readout of $$M$$ photodetector signals requires less time, resources, and energy than the readout of the whole image in a conventional image sensor. In fact, the photodiode array itself does not consume any energy at all; energy is only consumed by the electronic circuit that selects the highest photocurrent. Pattern recognition and classification occur in real-time and are only limited by the physics of the photocurrent generation and/or the electrical bandwidth of the data acquisition system. This is demonstrated in Fig. 5a, where we show the correct classification of an image on a nanosecond timescale, limited by the bandwidth of the used amplifier.

Furthermore, it is known that binning can offer an $$K^{1/2}$$-fold improvement in SNR19. In our case, a substantial fraction $$\xi$$ ($$\sim$$ 0.6 for NB) of all sensor pixels are binned together ($$K = \xi N$$), with each pixel being split into $$M$$ elements. Together, this results in a $$\left( {\xi N} \right)^{1/2} /M$$-fold SNR gain over the unbinned case. To characterize the noise performance, we performed binary image classification (NB, MNIST, ‘0’ versus ‘1’) at different light intensities. For the reference measurements, we projected the images sequentially, pixel by pixel, onto a single GaAs Schottky photodetector (fabricated on the same wafer and with an area identical to that of two subpixels), recorded the photocurrents, and performed the classification task in a computer. In the simulations, Gaussian noise was added by drawing random samples from a normal distribution $${\mathcal{N}}\left( {0,\sigma^{2} } \right)$$ with zero mean value. The noise was added once per superpixel in the data-driven case, and per each pixel in the reference case. $$\sigma$$ was used as a single fitting parameter to reproduce all experimental results. The results are presented in Fig. 5b. The classification accuracy is affected by the amplifier noise. For large intensities, the system operates with its designed accuracy. As the intensity is decreased, the classification accuracy drops and eventually, when the noise dominates over the signal, reaches the baseline of random guessing. Our device, employing data-driven binning, can perform this task at lower light intensities than the reference device without binning.

## Conclusions

We conclude with proposed routes for future research. The main limitation of our current device implementation is its lack of reconfigurability. While this may be appropriate in some cases (e.g. a dedicated spectroscopic application), reconfigurability of the sensor would in general be preferred. This may, for example, be achieved by employing photodetectors with tunable responsivities, or a programmable network based on a nonvolatile memory material26,27,28 to bin individual pixels together. Other schemes than standard one-hot encoding may allow to save hardware resources and extend the dynamic range further. Possible applications of our technology include industrial image recognition systems that require high-speed identification of simple objects or patterns, as well as optical spectroscopy, where the incoming light is dispersed into its different colors and the sensor is trained to recognize certain spectral features. In both cases classical machine learning algorithms will provide sufficient complexity and sophistication for the approximation of the dataset.

## Methods

### Device fabrication

Device fabrication started with the growth of a 400 nm thick $${\mathrm{n}}^{-}$$-doped ($${10}^{16}$$ $${\mathrm{cm}}^{-3}$$) GaAs epilayer by molecular beam epitaxy on a highly $${\mathrm{n}}^{+}$$-doped GaAs substrate. An ohmic contact on the $${\mathrm{n}}^{+}$$-side was defined by evaporation of Ge/Au/Ni/Au (15 nm/30 nm/14 nm/300 nm) and sample heating at 440 °C for 30 s. On the $${\mathrm{n}}^{-}$$-GaAs epilayer we deposited a 20 nm thick Al2O3 insulating layer by atomic layer deposition (ALD). We then defined a first metal layer (M1) by electron-beam lithography (EBL) and Ti/Au (3 nm/25 nm) evaporation. In the next step we deposited a 30 nm thick Al2O3 layer by ALD. We then defined an etch mask for the via holes, which connect metal layers M1 and M2, by EBL and etched the Al2O3 with 30% potassium hydroxide (KOH) aqueous solution. We then wrote an etch mask for the pixel windows via EBL and etched the aggregated 50 nm thick Al2O3 with a 30% KOH aqueous solution in two steps. Inside the pixel windows, we defined the subpixels with EBL by removing the naturally formed oxide on the GaAs substrate with a 37% hydrochloric acid (HCl) aqueous solution and evaporating 7 nm thick semitransparent Au. Finally, we defined the M2 metal layer with EBL and Ti/Au (5 nm/80 nm) evaporation. The continuity and solidity of the device was confirmed by scanning electron microscopy and electrical measurements.

### Experimental setup

A schematic of the experimental setup is shown in Fig. 3b. A light-emitting diode (LED) source (625 nm wavelength) illuminates, through a linear polarizer, a spatial light modulator (SLM). The SLM is operated in intensity-modulation mode and changes the polarization of the reflected light according to the displayed image. The reflected light is then filtered using a second linear polarizer, and the image is projected onto the chip. The photocurrents generated by the sensor are probed with a needle array, selected by a Keithley switch matrix and measured with a Keithley source measurement unit. For time-resolved measurements a pulsed laser source (522 nm wavelength, 40 ns) is used. Here, the output signals are amplified with a high-bandwidth (20 MHz) transimpedance amplifier. The pulsed laser source is triggered with a signal generator and an oscilloscope is used to record the time trace.