Partial Scan Electron Microscopy with Deep Learning

We present a multi-scale conditional generative adversarial network that completes 512$\times$512 electron micrographs from partial scans. This allows electron beam exposure and scan time to be reduced by 20$\times$ with a 2.6% intensity error. Our network is trained end-to-end on partial scans created from a new dataset of 16227 scanning transmission electron micrographs. High performance is achieved with adaptive learning rate clipping of outlier losses and an auxiliary trainer network. Source code and links to our new dataset and trained network have been made publicly available at https://github.com/Jeffrey-Ede/partial-STEM


Introduction
Scanning transmission electron microscopy (STEM) can achieve 1-2 pm precision [1] and is able to resolve atom columns. Nonetheless, beam damage [2,3] limits materials that can be studied to inorganic crystals and select organic structures. High-resolution STEM scans also take time, taxing experimenters and allowing microscope settings to drift. In response, we have developed a generative adversarial network [4] (GAN) to reduce electron beam exposure by completing realistic electron micrographs from partial scans. Examples are shown in fig. 1.
Conditional GANs [5] consist of sets of generators and discriminators that play an adversarial game. Generators learn to produce outputs that look realistic to discriminators. Meanwhile, discriminators learn to distinguish between real and generated examples. Limitedly, discriminators only assess whether outputs look realistic; not if they are correct. This can lead to mode collapse [6]: where generators only produce a subset of outputs. To lift the degeneracy, generator learning is conditioned on a distance between generated and correct outputs that is added to the adversarial loss. Meaningful distances can be learned automatically by considering differences between features imagined by discriminators for real and generated images [7,8].
Deep learning has a history of successful applications to image infilling, including image completion [9], irregular gap infilling [10] and supersampling [11]. This motivates the application of deep learning to the completion of partial scans. Examples of spiral and grid-like partial scans with 1/10, 1/20, 1/40 and 1/100 coverage are in fig. 2. Most infilling networks use non-adversarial mean squared errors (MSEs) for training. However, this results in blurry and unnatural infilling for large gaps. Non-neural methods have the same issues and higher errors e.g. [12]. In contrast, a conditional GAN can ensure realistic completions for partial scans with arbitrary coverage. This paper presents a multi-scale conditional GAN that completes 512×512 STEM images from partial scans. Our network configuration, new 16227 STEM image training dataset and learning policy are described in section 2. Performance and example applications are presented in section 3. Architecture and learning policy experiments are detailed in section 4. Finally, adaptive learning rate clipping to stabilize low batch size training is presented in section 7, followed by detailed network architecture in section 8.

Training
In this section, we discuss training with the TensorFlow [13] deep learning framework. Training was performed using ADAM [14] optimized stochastic gradient decent and takes over a week on an Nvidia GTX 1080 Ti GPU with an i7-6700 CPU.

Data Pipeline
A new dataset of 16227 32-bit floating point STEM images saved to University of Warwick data servers was collated for training. The dataset consists of individual micrographs made by dozens of scientists working on hundreds of projects and therefore has a diverse constitution. The dataset is available by request 1 and will be published as part of the 226862 labelled micrograph Warwick Large Electron Microscopy Dataset (WLEMD) in a following publication.
The dataset was split into 12170 training, 1622 validation and 2435 test micrographs. Micrographs were Figure 3: Simplified multi-scale generative adversarial network. An inner generator produces large-scale features from inputs. These are translated to half-size completions by a trainer network and recombined with the input to generate full-size completions by an outer generator. Multiple discriminators assess multi-scale crops from input images and full-size completions. not shuffled before splitting to reduce intermixing of the training, validation and test sets. This meant that training, validation and testing were performed with scans collected by different sets of scientists. Each micrograph was split into non-overlapping 512×512 crops from its top-left, producing 110933 training, 21259 validation and 28877 test crops. The difference between the dataset training-validation-test split of 0.75:0.10:0.15 and the crop split of 0.69:0.13:0.18 is a result of the lack of shuffling.
Image crops, I, were preprocessed by replacing nonfinite counts; NaN and ±∞, with zeroes. Next, crops were linearly transform to have intensities I N ∈ [−1, 1], except for uniform crops satisfying max(I) − min(I) < 10 −6 where we set I N = 0 everywhere. Finally, each crop was subject to a random combination of flips and 90 • rotations to augment the dataset by a factor of 8.
To reduce noise, normalized crops were low-pass filtered to I blur by a 5×5 symmetric Gaussian kernel with a 2.5 px standard deviation. Low-pass filtering also reduces MSE variance due to varying noise levels, allowing I blur to be used as the ground truth for non-adversarial training. Next, noisy partial scans were simulated. We experimented with two types of partial scans: randomly perturbed rectangular grids and spirals. Spirals are used in [12] and are a natural choice as a scanning electron beam can be made to spiral by oscillating its controlling magnetic fields.
Partial scan paths, I path , were drawn by adjusting the signal-to-noise ratios of traversed pixels. Partial scan signal-to-noise was reduced where pixels were traversed partially or quickly. Full details are in our source code. Partial scans, I scan , were simulated by combining scan paths with low-pass filtered micrographs where N is a function that simulates STEM noise [15]. For simplicity, we chose where U is a uniform random variate distributed in [0, 2).

Network Configuration
To generate realistic images, we employ a multi-scale conditional GAN. This can be partitioned into the six subnetworks shown in fig. 3: an inner generator, G inner , outer generator, G outer , inner generator trainer, T , and small, medium and large scale discriminators, D 1 , D 2 and D 3 . We refer to the compound network G(I scan ) = G outer (G inner (I scan ), I scan ) as the generator. The generator is the only network needed for inference. Multi-scale discriminators refer to the collection D = {D 1 , D 2 , D 3 }. Detailed architecture is in section 8.
Discriminators: Multi-scale discriminators examine real and generated STEM images to predict whether they are real or generated. Essentially, discriminators adapt to the generator as it learns. Each discriminator assesses a different-sized crop with size 70×70, 140×140 or 280×280, from 512×512 images. Typically, discriminators are applied to fractions of the full image size e.g. 1/2 0 , 1/2 1 and 1/2 2 times the output side length in [7]. However, we found that larger-scale discriminators have difficulty restoring realistic high-frequency STEM noise characteristics.
Using multiple discriminators at a single scale is proposed in [16] and extended to multiple scales in [7]. Following the assumption that images can be modelled as Markovian random fields, discriminators are applied to an array of non-overlapping image patches in [17]. However, discriminator arrays produce periodic artefacts [7] that have to be corrected by larger-scale discriminators. Instead, we prevent artefacts by applying multiple discriminators to random, possibly overlapping, regions at each scale.
If regions are selected using uniform random variates, pixels towards the edges of images will be examined less frequently. For a region of size u × v in an image of size h × w, the number of regions covering a pixel at (i, j), i ∈ Uniform random region selection would therefore scale an effective generator learning rate at (i, j) in proportion to c i,j . However, each output pixel has similar importance.
To impose isotropy, we set the probability of each pixel being covered to c −1 i,j /mean(c −1 ). Another option is to directly scale losses by c −1 i,j /mean(c −1 ). However, this would increase gradient variance, potentially destabilizing learning. Reflection or other padding can also be used to adjust coverage; however, it would introduce discriminators to unnatural artefacts.
For N = 3 discriminator scales with numbers, N 1 , N 2 and N 3 , of discriminators, D 1 , D 2 and D 3 , respectively, the total discriminator loss is Here, discriminators learn to predict 1 and 0 labels for real and generated images, respectively. Following [18], losses are squared differences from labels; rather than the binomial cross entropy introduced in [4], as logarithms can increase gradient variance. We found that N 1 = N 2 = N 3 = 1 is sufficient to produce realistic images. However, higher performance might be achieved with more discriminators e.g. 2 large, 8 medium and 32 small discriminators. Auxiliary trainer: Following Inception, we introduce an auxiliary trainer network [19,20] to provide a more direct path for gradients to back-propagate to the inner generator. Our auxiliary trainer learns to generate half-size completions, T (G inner (I scan )), that minimize Huberised [21] MSEs from bilinearly downsampled, half-size blurred ground truths, I half, blur , where H(x) := min(x, x 1/2 ) and λ trainer = 200. Generator: Our generator consists of two subnetworks, similar to [7]. An inner generator generates large-scale features from a half-size partial scan that are combined with input embedded by an outer generator to generate a full-size completion. The generator subnetworks are cooperative as they try to generate realistic completions that minimize the adversarial loss We chose a hinge loss [22][23][24]; which has no D i (I N ) term, to improve stability in the early stages of training.
Discriminators only assess the realism of generated micrographs; not if they are correct. To the lift degeneracy and prevent mode collapse, we condition adversarial training on a Huberised mean squared error between generated and blurred ground truth images, where λ cond = 200. To compensate for varying noise levels, ground truth images were blurred by a 5×5 symmetric Gaussian kernel with a 2.5 px standard deviation. We also tried natural statistics losses, similar to [7,8]. However, we found that non-adversarial MSE guidance converges to slightly lower MSEs and similar structural similarity indexes [25] for greyscale STEM images.
In addition, the inner generator subnetwork cooperates with the auxiliary trainer subnetwork to minimize L aux . Added together, the total generator loss is where λ adv and λ aux control the contributions of the adversarial and auxiliary losses, respectively. In our experiments, we chose λ adv = 5 and λ aux = 0.5.

Learning Policy
In this subsection, we discuss our training hyperparameters and learning protocol for the multi-scale conditional GAN summarized in fig. 3. Experiments can be found in section 4 and detailed architecture is in section 8.
Optimizer: Training is ADAM optimized [14] and has two stages. In the first stage, the generator and auxiliary trainer learn to minimize mean squared errors between their outputs and ground truth images. For the first 250000 iterations, we use a constant learning rate η 0 = 0.0003 and a decay rate for the first moment of the momentum β 1 = 0.9. The learning rate is then stepwise decayed to zero in eight steps over the next 250000 iterations i.e. to η = η 0 (1 − floor(8(iter − 250000)/250000)/8). Similarly, β 1 is stepwise linearly decayed to 0.5 in eight steps.
In the second stage, the generator and discriminators play an adversarial game conditioned on non-adversarial MSE guidance. For the next 250000 iterations, we use η = 0.0001 and β 1 = 0.9 for the generator and discriminators. In the final 250000 iterations, the generator learning rate is decayed to zero in eight steps while the discriminator learning rate remains constant. Similarly, generator and discriminator β 1 is stepwise decayed to 0.5 in eight steps.
The advantage of β 1 = 0.5 for adversarial training is demonstrated in [23]. Our decision to start at β 1 = 0.9 aims to improve the initial rate of convergence. In the first stage, generator and auxiliary trainer parameters are both updated once per training step. In the second stage, all parameters are updated once per training step.
Our 10 6 iteration learning policy is in-line with with other GANs, which reuse data for 200 epochs e.g. [7]. However, we note that validation errors do not plateau even if training is increased to 2 × 10 6 iterations. This suggests that performance may be substantially improved by further training. All training is performed with batch size 1 due to the large model size needed to complete 512×512 scans. Adaptive learning rate clipping: To stabilize batch size 1 training, adaptive learning rate clipping (ALRC) was applied to limit outlier MSEs. Details are in section 7. Input normalization: Partial scans, I scan , input to the generator are linearly transformed to I scan = (I scan + 1)/2, where I scan ∈ [0, 1]. The generator is trained to output ground truth crops in [0, 1], which are linearly transformed to [−1, 1]. Generator outputs and ground truth crops in [−1, 1] are directly input to the discriminators.
Weight normalization: All generator weights are weight normalized [26]. Following [26,27], running mean-only batch normalization is applied to the output channels of every convolutional layer except the last. Channel means are tracked by exponential moving averages with decay rates of 0.99. Similar to [28], running mean-only batch normalization is frozen in the second half of training to improve stability. Spectral normalization: Spectral normalization [23] is applied to the weights of each convolutional layer in the discriminators to control the Lipschitz constants of the discriminators. We use the power iteration method with one iteration per training step to enforce a spectral norm of 1 for each weight matrix.
Spectral normalization stabilizes training, reduces susceptibility to mode collapse and is independent of rank, encouraging discriminators to use more input features to inform decisions [23]. In contrast, weight normalization [26] and Wasserstein weight clipping [29] impose more arbitrary model distributions that may only partially match the target distribution. Activation: In the generator, ReLU [30] non-linearities are applied after running mean-only batch normalization. In the discriminators, slope 0.2 leaky ReLU [31] non-linearities are applied after every convolution layer. Rectifier leakage encourages discriminators to use more of their inputs to inform decisions. Our choice of generator and discriminator non-linearities follows [7]. Initialization: Generator weights were initialized from a normal distribution with mean 0.00 and standard deviation 0.05. To apply weight normalization, an example scan is then propagated through the network. Each layer output is divided by its L2 norm and the layer weights assigned their division by the square root of the L2 normalized output's standard deviation. There are no biases in the generator as running mean-only batch normalization would allow biases to grow unbounded c.f. batch normalization [32].
Discriminator weights were initialized from a normal distribution with mean 0.00 and standard deviation 0.03. Discriminator biases were zero initialized. Experience replay: To reduce destabilizing discriminator oscillations [33], we used an experience replay [34,35] with 50 examples. Prioritizing the replay of difficult experiences improves reinforcement learning [36] so we only replay hard examples. We define hard examples to be those that the generator has the highest conditional losses for. Examples were swapped into the the experience replay on average p in = 0.2 times per iteration and independently sampled without removal with probability p use = 0.2.
In detail, examples were added to the experience replay if their conditional losses were higher than a threshold, L t . This threshold was calculated from the first and second raw moments of the conditional losses, L cond,1 and L cond,2 , using exponential moving averages where we chose β cond,1 = β cond,2 = 0.99.
To calculate L t , we also use a moving average to monitor the rate that new examples were added to the replay, where we chose β in = 0.97. The raw moments and rate that new examples are being added to the replay are combined to inform the update where we chose β t = 0.99 and incremented the buffer threshold by ∆L ↑ = −∆L ↓ = 5σ cond . The calculation of L t might be improved by using small geometric; rather than arithmetic, adjustments, restricting it to being positive by adding the update L t → max(L t , 0) and accounting for asymmetric conditional losses, ∆L ↑ = −∆L ↓ . However, further improvements to L t accuracy are unlikely to have a significant effect on training.
Our method can be extended to the sampling probabilities of individual examples in the experience replay. This would allow the replay probabilities to be increased for examples with higher conditional losses. However, we did not feel this was necessary as we already select hard examples for the experience replay. In addition, high losses may be caused by momentary quirks in the early stages of training.

Performance
To characterize our generator's performance, we map its per pixel MSEs for adversarial and non-adversarial training. We also present example applications of our adversarial and non-adversarial networks to grid-like and spiral partial scans. Training inference time is 50 ms on our GTX 1080 Ti GPU, enabling live partial scan completion.

Network Error
Mean square errors for each pixel of our generator's output for 20000 test set images are shown in fig. 6 and tabulated in table 1. Non-adversarial training with ALRC, weighting pixel errors by their running mean errors, ALRC running means or ALRC weighted by the errors for ALRC and running means all have similar structured errors: Errors increase away from the electron beam path are are especially high at the output edges. In contrast, adversarial errors are higher as images have realistic noise characteristics. Adversarial errors are also less structured: the variance of the Laplacian for unit variance adversarial errors is 70× higher than for non-adversarial errors.

Example Scans
Examples application of a non-adversarial network to spiral and grid-like partial scans are shown in fig. 4. In practice, 1/20 scan coverage is sufficient to complete most micrographs from spiral partial scans. However, our network cannot reliably complete images with unpredictable structure in regions where there is no coverage. Performance is higher for spiral scans than gridlike scans as they have smaller gaps. Adversarial and non-adversarial test set completions are compared in fig. 5 (and fig. 1). Adversarial completions have realistic noise characteristics whereas non-adversarial completions are blurry. Adversarial completions also have more accurate colouration and less structured spatial error variation.

Experiments
In this section, we present learning curves for some of our non-adversarial architecture and learning policy experiments. All learning curves are 2500 iteration boxcar averaged. For clarity, the first 10 4 iterations before the dashed lines, where lossrd rapidly decrease, are not shown.
Following [7], we used a multi-stage training protocol for our initial experiments. Inner and outer generator subnetworks were trained separately, then together. An alternative approach uses an auxiliary loss network for end-to-end training, similar to Inception [19,20]. This can provide a more direct path for gradients to back-propagate to the start of the network and introduces an additional regularization mechanism. Experimenting, we connected an auxiliary trainer to the inner generator and trained the network in a single stage. As shown by fig. 7, auxiliary network supported end-to-end training is more stable and converges to lower errors.  For multi-stage learning curves, the first losses are reported for the inner generator. These are followed by an error spike where losses are reported for outer generator training while the inner generator is frozen. In the final stage, outer generator losses are reported as the inner and outer generator are fine-tuned together.
In encoder-decoders, residual connections [37] between strided convolutions and symmetric strided transpositional convolutions can be used to reduce information loss. This is common in noise removal networks where the output is similar to the input e.g. [38,39]. However, symmetric residual connections are also used in encoder-decoder Figure 9: Performance is higher for small first convolution kernels; 3×3 for the inner generator and 7×7 for the outer generator or both 3×3, than for large first convolution kernels; 7×7 for the inner generator and 17×17 for the outer generator.
networks for semantic image segmentation [40] where the input and output are different. Consequently, we tried adding symmetric residual connections between strided and transpositional inner generator convolutions. As shown by fig. 8, extra residuals accelerate initial inner generator training. However, final errors are slightly higher and initial inner generator training converged to similar errors with and without symmetric residuals. Taken together, this suggests that symmetric residuals initially accelerate training by enabling the final inner generator layers to generate crude outputs though their direct connections to the first inner generator layers. However, the symmetric connections also provide a direct path for low-information outputs of the first layers to get to the final layers, obscuring the contribution of the inner generator's skip-3 residual blocks (section 8) and lowering performance in the final stages of training.
Path information is concatenated to the partial scan input to the generator. In principle, the generator can infer electron beam paths from partial scans. However, the input signal is attenuated as it travels through the network [41]. In addition, path information would have to be deduced; rather than informing calculations in the first inner generator layers, decreasing efficiency. To compensate, paths used to generate partial scans from full scans are concatenated to inputs. As shown by fig. 8, concatenating path information reduces errors throughout training. Performance might be further improved by explicitly building sparsity into the network [42].
Large kernels are often used at the start of neural networks to increase their receptive field. This allows their first convolutions to be used more efficiently. The receptive  field can also be increased by increasing network depth, enabling the more efficient representation of some functions [43]. However, increasing network depth can also increase information loss [41] and representation efficiency may not be limiting. As shown by fig. 9, errors are lower for small first convolution kernels; 3×3 for the inner generator and 7×7 for the outer generator or both 3×3, than for large first convolution kernels; 7×7 for the inner generator and 17×17 for the outer generator. This suggests that the generator does not make effective use of the larger 17×17 kernel receptive field and that the variability of the extra kernel  parameters harms learning.
Learning curves for different learning rate schedules are shown in fig. 10. Increasing training iterations and doubling the learning rate from 0.0002 to 0.0004 lowers errors. Validation errors do not plateau for 10 6 iterations in fig. 11, suggesting that continued training would improve performance. Validation errors were calculated once every 50 training iterations for all experiments.
The choice of output domain can affect performance. Training with a [0, 1] output domain is compared against [-1,1] for slope 0.01 leaky ReLU activation after every generator convolution in fig. 12. Although [−1, 1] is supported by leaky ReLUs, requiring orders of magnitude differences in scale for [−1, 0) and (0, 1] hinders learning. To limit dependence on the choice output domain, we do not apply batch normalization or activation after the last generator convolutions in our final architecture.
The [0, 1] outputs of fig. 12 were linearly transformed to [−1, 1] and passed through a tanh non-linearity. This ensured that [0, 1] output errors were on the same scale as [−1, 1] output errors, maintaining the same effective learning rate. Initially, outputs were clipped by a tanh non-linearity to limit outputs far from the target domain from perturbing training. However, fig. 13 shows that errors are similar without end non-linearites so they were removed. Fig. 13 also shows that replacing slope 0.01 leaky ReLUs with ReLUs and changing all kernel sizes to 3×3 has little effect. Swapping to ReLUs and 3×3 kernels is therefore an option to reduce computation. Nevertheless, we continue to use larger kernels throughout as we think they would usefully increase the receptive field with more stable, larger batch size training.
To more efficiently use the first generator convolutions, we nearest neighbour infilled noiseless partial scans. As shown by fig. 14, infilling reduces error. However, infilling is expected to be of limited use for low-dose applications as scans can be noisy, making meaningful infilling difficult. Nevertheless, nearest neighbour partial scan infilling is a computationally inexpensive method to improve generator performance for high-dose applications.
To investigate our generator's ability to handle STEM noise [15], we combined uniform noise with partial scans of Gaussian blurred STEM images, as described by eqn. 1. More noise was added to low intensity path segments and Figure 15: Errors are similar with and without adding uniform noise to low-duration path segments. Figure 16: Learning is more stable and converges to lower errors at lower learning rates. Errors are lower for spirals than grid-like paths and lowest when no noise is added to low-intensity path segments. low-intensity pixels. As shown by fig. 15, ablating the noise associated with low-duration path segments increases performance. Fig. 16 shows that spiral path training is more stable and reaches lower errors at lower learning rates. At the same learning rate, spiral paths converge to lower errors than grid-like paths as spirals have more uniform coverage. Errors are much lower for spiral paths when both intensityand duration-dependent noise is ablated.
To choose a training optimizer, we completed training with stochastic gradient descent, momentum, Nesterovaccelerated momentum [44,45], RMSProp [46] and  ADAM [14]. Adaptive momentum optimizers, ADAM and RMSProp, outperform the non-adaptive optimizers. Non-adaptive momentum-based optimizers outperform momentumless stochastic gradient decent. ADAM slightly outperforms RMSProp; however, architecture and learning policy were tuned for ADAM. This suggests that RMSProp optimization may also be a good choice.
Learning curves for 1/10, 1/20, 1/40 and 1/100 coverage spiral scans are shown in fig. 18. In practice, 1/20 coverage is sufficient for most STEM images. A non-adversarial generator can complete test set 1/20 coverage partial scans with a 2.6% root mean squared intensity error. Nevertheless, higher coverage is needed to resolve fine detail in some images. Likewise, lower coverage may be appropriate for images without fine detail. Consequently, we are considering the development of an intelligent scan system that adjusts coverage based on micrograph content.
Training is performed with batch size 1 due to the large network size needed for 512×512 partial scans. However, MSE training is unstable and large error spikes destabilize training. To stabilize learning, we developed adaptive learning rate clipping (ALRC, section 7) to limit the magnitudes of outlier losses while preserving their distributions. ALRC is compared against MSE, Huberised MSE, and weighting each pixel's error by its Huberised running mean and fixed final errors in fig. 19. ALRC results in more stable training with the fastest convergence and lowest errors. Similar improvements are confirmed for CIFAR-10 supersampling in section 7.

Conclusions
We have demonstrated that adversarial deep learning can realistically complete partial scans with less than 1/20 coverage. This will enable faster imaging and new beam-sensitive applications. High performance is achieved by the introduction of an auxiliary trainer network and adaptive learning rate clipping of outlier losses.

Acknowledgements
This research was funded by EPSRC grant EP/N035437/1.

Adaptive Learning Rate Clipping
To stabilize small batch size training, we developed adaptive learning rate clipping (ALRC, algorithm 1) as a computationally inexpensive method to limit outlier losses while preserving their distributions. In section 4, we showed that ALRC stabilized partial scan completion training converges faster and achieves lower final errors than other methods. To validate ALRC, we investigate its ability to stabilize the training of supersampling networks that upsample CIFAR-10 [47,48] images to 32×32×3 after downsampling to 16×16×3.
Data pipeline: In order, images were randomly flipped left or right, had their brightness distorted, had their contrast distorted, were linearly transformed to have zero mean and unit variance and bilinearly downsampled to 16×16×3.
Architecture: Images were upsampled and passed through the convolutional network in fig. 20. Each convolution is followed by ReLU activation, except the last. All weights were Xavier [49] initialized. Biases were zero initialized.
Learning policy: ADAM optimization was used with the hyperparameters recommended in [14] and a base learning rate of 1/1280 for 100000 iterations. The learning rate was constant in batch size 1, 4, 16 experiments and decreased to 1/12800 after 54687 iterations in batch size 64 experiments. Networks were trained to minimize mean squared or quartic errors between restored and ground truth images. Adaptive learning rate clipping was applied to limit the magnitudes of losses to either 2, 3, 4 or ∞ standard deviations above their running means. For batch sizes above 1, ALRC was applied to each loss individually.
Experiments: Example learning curves for mean squared and quartic error training are shown in fig. 21. Training is more stable and converges to lower errors for larger batch sizes. Training is less stable for quartic errors than squared errors, allowing ALRC to be examined for loss functions with different stability.
Training was repeated 10 times for each combination of adaptive learning rate threshold and batch size. Means and standard deviations of the means of the last 5000 training losses for each experiment are tabulated in table 2. Adaptive learning rate clipping has no effect on squared error training, even for batch size 1. However, it decreases errors for batch sizes 1, 4 and 16 for quartic error training.
Putting the results together, ALRC is only effective if there are large error spikes that would destabilize convergence. This situation is often encountered when using a high learning rate. However, we are using a moderate learning rate so squared errors are not spiking high enough to destabilize convergence. ALRC is less effective for large batch sizes because averaging decreases the gradients of outlier losses.
Initialize running means, µ 1 and µ 2 , with decay rates, β 1 and β 2 . Choose number, n, of standard deviations to clip to. while Training is not finished do Infer forward-propagation loss, L.
Optimize network by back-propagating L dyn . ALRC is easy to implement for arbitrary losses and batch sizes. An implementation is included in https: //github.com/Jeffrey-Ede/partial-STEM. In addition, ALRC can be extended to other properties of loss distributions. We also experimented with transformations of error distributions to constant distributions. However, we found that this made networks numerically unstable partway through training. Figure 21: Unclipped learning curves for 2× CIFAR-10 upsampling with batch sizes 1, 4, 16 and 64 with and without adaptive learning rate clipping of losses to 3 standard deviations above their running means. Training is more stable for squared errors than quartic errors. Learning curves are 500 iteration boxcar averaged.  Table 2: Adaptive learning rate clipping for losses 2, 3, 4 and ∞ running standard deviations above their running means for batch sizes 1, 4, 16 and 64. Each squared and quartic error mean and standard deviation is for the means of the final 5000 training errors of 10 experiments. Adaptive learning rate clipping lowers errors for unstable quartic error training at low batch sizes and otherwise has little effect. Means and standard deviations are multiplied by 100. Generator and inner generator trainer architecture is shown in fig. 22. Discriminator architecture is shown in fig. 23. The components in our networks are Bilinear Downsamp, wxw: This is an extension of linear interpolation in one dimension to two dimensions. It is used to downsample images to w×w. Bilinear Upsamp, xs: This is an extension of linear interpolation in one dimension to two dimensions. It is used to upsample images by a factor of s. Conv d, wxw, Stride, x: Convolution with a square kernel of width, w, that outputs d feature channels. If the stride is specified, convolutions are only applied to every xth spatial element of their input, rather than to every element. Striding is not applied depthwise. Linear, d: Flatten input and fully connect it to d feature channels. Random Crop, wxw: Randomly sample a w×w spatial location using an external probability distribution. + : Circled plus signs indicate residual connections where incoming tensors are added together. These help reduce signal attenuation and allow the network to learn perturbative transformations more easily. All generator convolutions are followed by running mean-only batch normalization then ReLU activation, except output convolutions. All discriminator convolutions are followed by slope 0.2 leaky ReLU activation.