Spiking neurons with short-term synaptic plasticity form superior generative networks

Spiking networks that perform probabilistic inference have been proposed both as models of cortical computation and as candidates for solving problems in machine learning. However, the evidence for spike-based computation being in any way superior to non-spiking alternatives remains scarce. We propose that short-term synaptic plasticity can provide spiking networks with distinct computational advantages compared to their classical counterparts. When learning from high-dimensional, diverse datasets, deep attractors in the energy landscape often cause mixing problems to the sampling process. Classical algorithms solve this problem by employing various tempering techniques, which are both computationally demanding and require global state updates. We demonstrate how similar results can be achieved in spiking networks endowed with local short-term synaptic plasticity. Additionally, we discuss how these networks can even outperform tempering-based approaches when the training data is imbalanced. We thereby uncover a powerful computational property of the biologically inspired, local, spike-triggered synaptic dynamics based simply on a limited pool of synaptic resources, which enables them to deal with complex sensory data.

Boltzmann machines and spiking networks. Among the neural networks proposed as generative models for high-dimensional input, Boltzmann machines (BMs) 23 are arguably the most prominent [24][25][26][27] . Neurons in BMs are binary units with states z k ∈ {0, 1}. These states are typically updated in a sequential schedule in a way that implements Gibbs sampling from a target Boltzmann distribution with the inverse temperature β ∈ (0, 1], partition function Z and the energy function E(z) = −z T Wz/2 − z T b parametrized by the weight matrix W and bias vector b. This is achieved by having each neuron compute a local "membrane potential" as the log-odds of its conditional firing probability, which for the Boltzmann distribution is equal to a weighted sum over input activities: Consequently, state updates are computed using a logistic activation function p(z k = 1) = [1 + exp(u k )] −1 =: σ(u k ).
In a restricted Boltzmann machine (RBM), the units are subdivided into a visible and a hidden layer, with no within-layer connections (Fig. 1a). To enable classification, an additional label layer can be added, which for training purposes can be treated as part of the visible layer. During training, weights and biases are iteratively updated in order to optimize the marginal distribution p(v, l|h) as the underlying distribution for the set of training samples.
Recently, it has been shown how networks of spiking neurons can perform equivalent computations 28 , which we briefly outline in the following. The building blocks for our spiking networks are conductance-based leaky integrate-and-fire (LIF) neurons, with membrane potential dynamics governed by where C m is the membrane capacitance, g l and E l the leak conductance and potential, and I syn the synaptic current. Neurons fire upon reaching a threshold voltage, which causes the membrane to be clamped to a reset potential for a duration equal to the refractory period τ ref . The synaptic current is modeled as a sum of exponential kernels triggered by presynaptic spikes s with a synaptic time constant τ syn and weighted by synaptic weights W i (t) and reversal potentials E i rev : (b) Interpretation of states as samples in a spiking network. A neuron with a freely evolving membrane potential is said to be in the state z k = 0 and switches to the state z k = 1 upon firing, where it stays for the duration of the refractory period. (c) In order to correctly sample from a posterior distribution, a network needs to be able to mix, i.e., traverse barriers between low-energy basins. To facilitate mixing, tempering methods globally rescale the energy landscape with an inverse temperature (top). In contrast, STP can be viewed as only modulating the energy landscape locally, thereby only affecting the currently active attractor (bottom The temporal dependence of the synaptic weights accounts for the STP mechanism we discuss later. Each neuron receives both functional synaptic input from other neurons within the network and diffuse background input from external neurons that can be modeled as Poisson spike trains. The latter causes the neuron to fire stochastically. Since, at the level of spikes, the output of a neuron can be considered binary, we associate a binary random variable z k to each neuron. As a neuron never fires within the refractory period, it is natural to set and 0 otherwise (Fig. 1b). For constant functional synaptic input as defined above, the mean firing rate of such a neuron is proportional to its activation function p(z k = 1). By applying strong background input, we lift neurons into a high-conductance state (HCS) 29 , which molds their activation function into an approximately logistic shape 30 : with scaling parameters α and β, where u k f represents the functional, i.e., background-free, membrane potential. Similarly to Gibbs sampling, the functional membrane potential thereby fulfills the local computability condition (equation 2), which is a sufficient computational prerequisite for sampling in neural networks 23,31 . The scaling parameters can be derived analytically and allow a direct translation of the BM parameters W and b to the corresponding parameters in the biological domain (Methods, Sec. 1).

Training.
To speed up training, we used RBMs with binary units. As a learning algorithm, we used the coupled adaptive simulated tempering (CAST) method 16 , which is a version of the wake-sleep algorithm. In CAST, two instances of the RBM are simulated in parallel, with one of them staying at a constant inverse temperature β = 1 for parameter update using persistent contrastive divergence 32 (slow chain) and the other one using adaptive simulated tempering (AST) 16 for mixing (fast chain). The states of the two RBMs are swapped constantly to help the slow chain jump out of local minima during parameter updating. In AST, states z (t+1) are updated by Gibbs sampling from β | p z ( ) T t ( ) . After each state update, the temperature is itself updated by an adaptive rule that ensures the algorithm spends a roughly equal amount of time at each value β T . Details of the AST algorithm are described in Table 1. The training hyperparameters for different experiments can be found in Methods Sec. 2.
The trained RBM parameters are then mapped to the spiking-network domain as described below 28 (see Methods Sec. 1 for more details): where W kj denotes the peak synaptic conductance (see equation 4), C m the membrane capacitance, w kj the abstract Boltzmann weight, E kj rev the corresponding reversal potential, μ the mean free membrane potential, τ syn the synaptic time constant and τ eff = C m /〈g tot 〉 the (mean) effective membrane time constant.
Tempering vs. short-term plasticity. When trained from data, the energy landscape E(z) is shaped in a way that assigns low energy values (modes) to the samples in the training data. If this dataset is composed of very dissimilar classes, training algorithms tend to separate them by high energy barriers 16,33 . As their height grows during training, Gibbs sampling becomes increasingly ineffective at covering the entire relevant state space, as reflected by a high correlation between consecutive samples caused by the component-wise update of states 16,17,33,34 . Consequently, a BM would need longer to converge towards its underlying distribution. This problem becomes particularly inconvenient when dealing with complex, real-world data, or when an agent must rely on the prediction of the network to make a fast decision.
The ability of a sampling-based generative model to jump across energy barriers, also known as mixing, has therefore received significant attention 16,17,35,36 . Many of these methods rely on some version of simulated 1: Given adaptive weights = g { } k k K 1 and the initial configuration of the state z 1 at temperature 1, k = 1:

3:
Given z t , sample a new state z t+1 from p(z|k t ) by Gibbs sampling.

4:
Given k t , sample k t+1 from proposal distribution q(k t+1 ←k t ). Accept with probability: min 1, p z t k t q k t k t g k t p z t k n q k t k t g k t tempering, which modifies the temperature parameter β T in order to globally flatten the network's energy landscape (Fig. 1c). Therefore, in addition to conventional Gibbs sampling, we use the AST algorithm 16 as a benchmark for our spiking networks (Methods, Sec. 2). While greatly increasing the mixing capabilities of generative networks, it is important to note that all tempering schedules come with a cost of their own, both because they require additional computations and because they only gather valid samples at low temperatures (β ≈ 1), thereby effectively slowing down the sampling process. Furthermore, they require parameter changes that assume knowledge about the global state of the network, which is difficult to reconcile with biology. This motivates the search for a local update rule that has biological relevance, improves mixing and can be embedded in spiking networks.
In biological neural networks, the momentary synaptic interaction strength is reflected in the size of the elicited postsynaptic potential (PSP). In dynamic synapses, this value may change over time depending on the presynaptic activity. To model this dependence, we use the Tsodyks-Markram model of STP 37 : Here, w represents the (static) synaptic weight and U ∈ [0, 1] the utilized fraction of available synaptic resources R ∈ [0, 1]. Upon arrival of a presynaptic spike at time t s , the synapse is depressed by subtracting U from R, which recovers exponentially with the time constant τ rec . Facilitation is modeled by a simultaneous increase in U, followed by an exponential decay with time constant τ fac .
Since both tempering and STP effectively modify the energy landscape by changing network parameters during sampling, they clearly bear some conceptual resemblance. However, while tempering simultaneously affects all synaptic weights, STP only affects the efferent connections of those neurons that are simultaneously active at a given moment in time. Therefore, in contrast to the global modifications of the energy landscape incurred by tempering, STP has a more local effect, as sketched in Fig. 1c. Note that this effect is not equivalent to neuronal adaptation, because it does not prohibit neurons from remaining active over extended periods, which is essential for generating consecutive patterns with significant overlap.

Results
We study the effects of STP on the performance of spiking networks trained for different tasks. We start by discussing how STP can improve the sampling accuracy of small networks configured to sample from a fully specified target distribution, even when the energy landscape is shallow enough to not cause mixing problems. This is no longer the case for hierarchical networks trained directly on data, for which we study the influence of STP on both their generative and their discriminative properties. Furthermore, we show how STP can aid pattern completion in a network trained on a highly imbalanced dataset. For all spiking network simulations, we used NEST 38 with PyNN 39 as a front-end.
Sampling from a fully specified target distribution. By modulating synaptic interactions, STP shapes the sampled distribution. This can be helpful when a spiking network needs to approximate a distribution that is otherwise incompatible with biological neuro-synaptic dynamics, as we discuss in the following.
Consider the case where the target distribution of the spiking network is a Boltzmann distribution. When a neuron needs to continuously represent a state z k (t) = 1 for an extended period, it fires a sequence of n spikes at maximum frequency 1/τ ref . Following equation 2, the resulting PSPs should increase a postsynaptic neuron's membrane by a constant Δu i = w ik , which implies a rectangular PSP shape. However, this is not a realistic shape for a more biologically plausible scenario, where PSPs have an exponentially shaped decay. This causes them to accumulate (Fig. 1d), such that the average increment 〈Δu i 〉 n becomes a function of the burst length n, thereby distorting the sampled distribution.
Synaptic depression can mitigate this effect (Fig. 2a) by causing a gradual decrease in the amplitude of consecutive PSPs. Indeed, when sweeping over the (U 0 , τ rec , τ fac ) parameter space (Fig. 2b), we find that an optimal reproduction of the target distribution is achieved for 15 ms rec τ ≈ , which is close to the synaptic time constant of τ syn = 10 ms. This affords an intuitive explanation: In the HCS, the effective membrane time constant becomes small and τ syn dominates the PSP decay. If the recovery of synaptic resources R (equation 8) happens at the same speed as the PSP decay, the STP mechanism essentially emulates a renewing synapse with an approximately constant running average (Fig. 1d). The slightly larger optimal recovery time constant further compensates for the long tails of exponential PSPs, which potentiate interaction strengths compared to the ideal case of rectangular PSPs (Methods, Sec. 1). Note that the manifold for which the target distribution is close-to-optimally reproduced contains many different STP configurations, including the range of biologically observed parameters 40 , but not the (u, τ rec , τ fac ) = (1, 0, 0) triplet for static synapses (Fig. 2b).
For the example in Fig. 2a, we used a fully specified target distribution p B (z|W, b); training was not needed, as synaptic weights can be computed directly from the parameters W and b (Eqn. 10). Here, we used a target Boltzmann distribution with randomly drawn parameters that produce a diverse energy landscape, but not so rough as to create problems with mixing. This changes when the network parameters are learned from data, as we discuss in the following.
Mixing in a simple learning scenario. Borrowing from observations in the early visual system, we generated images of oriented bars. The bars were positioned in a way that gave rise to an "easy" (overlapping) and a SCIentIfIC REPORTS | (2018) 8:10651 | DOI:10.1038/s41598-018-28999-2 "hard" (non-overlapping) dataset (Fig. 2c). We then trained a two-layer hierarchical network (400 visible, 30 hidden units) on each of these datasets using CAST 16 . Intuitively, the difficulty of learning a generative model of this data increases when the bars have little or no overlap: in this case, training gives rise to three nearly disjoint populations that have, on average, excitatory connections within and inhibitory connections between them. The emergence of such a population-based winner-take-all structure can be characterized by the mean interaction between two population activity vectors z i and z j 〈 〉, which represent the average network activity during the presentation of the i th and j th input pattern, respectively. For the easy dataset, learning gave rise to a mean within-population interaction strength of = . ij i j for the hard dataset, reflecting the increased competition and disjointedness between the three emerging populations. STP, however, can weaken active synapses, temporarily reducing w to enable switching between attractors.
The learned parameter set was used to compare the performance of classical Gibbs sampling and STP-endowed spiking networks (Fig. 2c). For the easy dataset, both the Gibbs sampler and the spiking network were able to mix, although the former spent on average 100 times longer in the same mode before switching, thereby requiring more time to converge to the target distribution. For the hard dataset, the spiking networks retained their ability to mix, whereas Gibbs sampling was unable to leave the (randomly initialized) local mode. These observations mirror those found in studies of cortical attractor networks 41 . While this simple experimental setup was specifically designed to illustrate the potential problems of sampling-based generative models and the ability of STP-endowed spiking networks to circumvent them, we show in the following that these properties are preserved in more complex scenarios.
Generation and classification of handwritten digits. The problem of mixing becomes even more pronounced when dealing with larger, more complex datasets. Here, we trained a hierarchical 3-layer network with 784 visible, 600 hidden and 10 label units on handwritten digits from the MNIST dataset 42 . By treating the label units as part of the visible layer during training, we simultaneously trained a generative and a discriminative model of the data. This objective is particularly challenging, because mechanisms that improve mixing tend to disrupt classification and vice-versa 17 .
To evaluate the quality of generated samples, we computed a log-likelihood estimation of 2000 test images (not used during training) using the indirect sampling likelihood (ISL) method 33,34 (see also Methods). Due to the size of the network, a full scan of the parameter space for finding optimal STP parameters was no longer feasible. Therefore, starting from a good parameter set found by trial and error, we performed two 2D-scans of the (U 0 τ rec , τ fac ) parameter space (Fig. 3a). As in the previous examples, we found short-term depression to be essential for achieving high ISL values. Furthermore, a small value of U 0 combined with short-term facilitation was also beneficial, allowing an initial strengthening followed by a weakening of the active attractor, as sketched in Fig. 1c,d. Similar observations have been made in cortex, where STP can promote the enhancement of transients 43 .
We used one of the optimal STP parameter sets (U 0 = 0.01, τ rec = 280 ms) to compare the generative performance of spiking networks to classical Gibbs sampling. Due to its improved mixing capability, the spiking network was able to quickly cover a large portion of the relevant state space, as reflected by a faster ISL gain during sampling (Fig. 3b). This is a systematic effect and only weakly dependent on initial conditions, as can be seen in Fig. 3c, which shows a histogram over 100 random seeds. For this comparison, we chose a sampling duration of 10 s as a conservative estimate for the maximum duration for a biological agent to experience stable stimulus conditions and therefore sample from a stable target distribution. The faster mixing is the result of the spiking network's ability to jump out of local attractors, which is reflected in a much shorter time spent on average within Depressing synapses (bottom) allow the network to come much closer to the target distribution (blue) than non-plastic ones (top). (b) Kullback-Leibler divergence between sampled (p N ) and target (p B ) distribution of a spiking network with 10 neurons (5 hidden, 5 visible) for different STP parameters (U 0 , τ rec , τ fac ). Note that many different parameter combinations lead to close to optimal (white cross) sampling, but static synapses (black cross) are not among them. (c) Left: Training data for the easy (top) and hard (bottom) learning scenario. The 3 images from the training set, each containing a single oriented bar, are superimposed to highlight the overlap of the oriented bars (or lack thereof). Right: Sequence of images generated by a Gibbs sampler and an STP-endowed spiking network with equivalent parameters W and b. For each method, 20 samples are taken from 5000 consecutively generated images with equal interval. the same mode (Fig. 3d). Here, we defined a mode as the dominant class of the currently represented image; a mode was therefore defined by the identity of the neuron in the label layer with the highest firing rate.
It is important to note that, due to the STP-modulated interaction, the spiking network does not sample from the exact same distribution as the Gibbs sampler, despite using an equivalent (W, b) parameter set. However, for a very large number of samples (>10 5 ), the two methods converge towards the same ISL (Fig. 3b), indicating that the discrepancy in performance for shorter sampling durations is not due to a fundamental difference in their respective ground truths.
While the ISL, as an abstract quantity, provides a useful numerical gauge of the quality of a generative model, a direct depiction of the produced images is particularly instructive. Here, we used the t-distributed stochastic neighbor embedding (tSNE) method 44 (see also Methods) for a 2D-embedding of the high-dimensional sampled distribution. The similarity between samples is largely reflected by their 2D distance and a large jump can be interpreted as a switch between attractors. As seen in Fig. 3e, the spiking network produces a significantly more diverse set of samples compared to the Gibbs sampler.
When the visible layer is clamped to a particular input, the same network can be used as a discriminative model of the learned data. Using the same parameters as for the generative task, the benchmark Gibbs sampler obtained a classification accuracy of 93.4% on the MNIST test data. The spiking network with STP performed only slightly worse, at 93.2%. The additional generative capabilites gained by the spiking networks through STP were therefore not strongly detrimental to their classification accuracy.
Modeling imbalanced datasets. In many real-world scenarios, the available data is imbalanced, with much of the data belonging to one class and significantly less samples being distributed over others. It is well-known that imbalanced data can cause severe problems for data mining and classification 45,46 . One solution is to create a more balanced dataset from the imbalanced one, which can be achieved by methods such as under-or over-sampling 46,47 . However, such an a-priori modification of the input data does not seem biologically plausible. Still, cognitive biological agents appear to easily overcome this problem: humans will have little difficulty imagining a platypus from seeing only its bill, despite having likely seen many more ducks throughout their lifetime. Spiking networks with STP provide a simple solution to the problem of imbalanced training data, without any need for preprocessing.
We generated an imbalanced dataset of 1000 images by randomly selecting 820 digits of class "1" and 45 from the "0", "2", "3" and "8" classes. After training, we compared the generative output of a Gibbs sampler, an AST sampler and a spiking network with STP. Note that the effective sampling speed of AST is roughly 20 times slower compared to Gibbs sampling, since most of the produced samples are not considered valid. In this scenario, it becomes particularly useful that the spiking network transiently modifies the learned data distribution (Fig. 4a). The STP-induced weakening of active attractors balances out their activity, thereby negating the inherent imbalance induced by the training data. Furthermore, as observed before, the spiking network switches faster between modes (Fig. 4b,c).
These abilities become particularly useful in a scenario of inference based on incomplete information, for which pattern completion is a prime example. Here, we used a training set with 6 majority classes ("0", "1", "2", "3", "4", "6", 800 samples each) and one minority class (200 samples of "5"). We generated an ambiguous image by clamping the lower half of the visible layer to a configuration compatible with both a "3" and a "5" (Fig. 4d). While Gibbs and AST strongly undersample the minority class, the spiking network produces a much more balanced set of images, with swift transitions between modes (Fig. 4e,f). The estimate of the possible realities underlying the incomplete observation is therefore improved both on long and on short time scales. This can be particularly useful for an agent in need of a quick reaction, as, for example, often required in nature in a fight-or-flight scenario.

Discussion
We have shown how a combination of spike-based communication and short-term plasticity can enhance the ability of neural networks to perform probabilistic inference in high-dimensional data spaces. Here, a spike-triggered plasticity rule played a similar role to simulated tempering methods used for classical neural networks, but without requiring complex computations on the global network state or long waiting times between valid samples. The spiking networks outperformed their classical counterparts as generative models of real-world data, with little disturbance to their classification capability, which we expect to be largely remediable by additional fine-tuning of the network parameters. Furthermore, they were also able to cope with imbalanced training data, as demonstrated by their superior performance in a pattern completion task on ambiguous input. Intriguingly, the synaptic parameters used to achieve this performance are compatible to experimental data 40 .
As a potential downside of the functional gains discussed in the manuscript, the inclusion of more complex membrane and synapse dynamics are likely to increase the computational cost of applying our paradigm to classical neural networks such as Boltzmann machines. However, we expect a simple, local synaptic update rule to be overall more efficient than global updates required by, e.g, tempering schedules. Furthermore, the increased performance should be largely independent of the particular shape of a PSP and thereby of the used neuron model. Moreover, in physical neuromorphic emulation, added complexity in neural dynamics incurs no runtime penalty 21,22 .
Depending on the nature of the sampled distribution and the associated optimal parameters, STP can play a dual role. For low-dimensional spaces in which networks only rarely have mixing problems, STP can narrow the gap between the sampled and the target distribution. On the other hand, in large networks trained from data, both the improved mixing and the balancing effect represent functionally advantageous distortions of the network's underlying Boltzmann distribution. The deviations are a consequence of both the dynamic nature of plastic weights as well as due to emerging asymmetries in the connectivity matrix.
The STP parameters themselves require only little tuning, as evidenced by the comparatively large volume in parameter space that enhances performance, especially for high-dimensional problems. However, the optimal parameter set may vary, depending on the nature of the learned problem. In a machine learning context, various algorithms for meta-paremeter optimization have been proposed and could be applied to STP as well [48][49][50] . With respect to biology, as the function and location of individual brain areas remains largely conserved both within and among species, we speculate evolution to have played a key role in parameter optimization.
In fact, it was suggested that during a working memory task studied in vivo with rats, short-term synaptic depression in the medial prefrontal cortex sets the "life time" of high-dimensional neuronal assemblies that code for the integrated representation of position and sensory inputs 51 . While the rat navigates in a maze, the representation moves from one assembly to another on a time scale that roughly corresponds to the one of synaptic depression. Short-term synaptic plasticity, originally found in rat sensory cortices 52 , has also been found in the prefrontal cortex using paired recordings in vitro 53 .
In a physical system such as a biological brain, the studied plasticity mechanism essentially comes for free, as it only requires a limited pool of synaptic resources. Together with other activity-modulating mechanisms such as neuronal adaptation, it could be a key contributor to the ability of the brain to navigate efficiently in a very-high-dimensional stimulus space. Importantly, these mechanisms provide immediate computational advantages for spike-based neuromorphic devices, facilitating the development of efficient artificial agents that replicate the inferential capabilities of their biological archetypes.

Methods
Spiking networks. To generate our spiking sampling networks, we follow 28 . We emulate an HCS by stimulating the LIF neurons (Conductance-based (COBA), Table 2) with balanced excitatory and inhibitory Poisson noise This produces an approximately logistic activation function (Fig. 5a), parametrized by a shift β and a scaling parameter α (equation 5). These parameters can be used to translate synaptic interaction strengths from the Boltzmann domain to synaptic conductances: where W kj denotes the peak synaptic conductance (see equation 4), C m the membrane capacitance, w kj the abstract Boltzmann weight, E kj rev the corresponding reversal potential, μ the mean free membrane potential, τ syn the synaptic time constant and τ = C g / eff m tot the (mean) effective membrane time constant. This corresponds to a match of the average interaction during the refractory period of the presynaptic neuron (Fig. 5b). This setup allows an accurate sampling from target Boltzmann distributions (Fig. 5c,d).
To speed up simulations, we used an effective current-based (CUBA) model to replace the COBA one ( Table 2). Figure 5e shows a comparison between the two models. Under appropriate parametrization, we could reduce the background input rates from ν = 5 kHz to ν = 0.4 kHz.

Indirect sampling likelihood.
To have a quantitative comparison of mixing between different sampling procedures, we used the indirect sampling likelihood (ISL) method 33,34 . The method constructs a non-parametric density estimator to evaluate how close each test example is from any of the generated examples. The likelihood of a test sample y given a series of generated sample {x i } is defined as: where N is the number of generated samples, d is the dimension of y or x i and β is a hyperparameter which controls the gain (β) and punishment (1−β) to the likelihood when comparing the test sample with the generated sample. We set β = 0.95 to optimize the likelihood; other values (β∈(0.5,1]) would rescale the likelihood but without causing qualitative differences. In Fig. 3b, we plot the mean log-likelihood of 2000 samples from the test set against the number of generated samples. The faster increase of the ISL curve for the spiking network is due to better mixing, as the generated  samples cover the main modes of the test samples faster (Fig. 3d,e). To provide a frame of reference, we also plotted two additional ISL curves. The POM (product of marginals) sampler generated images by sampling each pixel individually from its intensity distribution over the entire training set. This sampler preserves the marginal probability distributions for each pixel, but discards any further structure of the image (encoded in correlations between pixel intensities). The OPT (optimal) sampler started out with a base set of 10 5 images generated with AST, from which it randomly picked images sequantially. This guarantees optimal mixing for the underlying model, because the base set covers all main modes of the state space, but consecutive samples have no correlation.
t-distributed stochastic neighbor embedding. The tSNE method 44 finds a low-dimensional map for a high-dimensional data set, in which the similarity between samples is reflected by their distances in the low-dimensional map. Here, we projected the generated digits to a plane to provide an intuitive understanding of the network dynamics and the mixing between different modes (digit classes). The Euclidean distances between high-dimensional samples {x i } are converted into symmetric pairwise similarities with variance σ i , which is determined by first defining a so-called perplexity value as the effective number of neighbors of a data point, and then running a binary search. For the low-dimensional points y i and y j mapped from the high-dimensional data points x i and x j , the similarity is defined using a t-distribution with one degree of freedom: (1 ) If the mapped points correctly model the similarity between the high-dimensional data points, the similarities p ij and q ij will be equal.
With this motivation, tSNE minimizes the sum of Kullback-Leibler divergences over all data points using a gradient descent method. The cost function C is given by  Table 2).