Precise atom manipulation through deep reinforcement learning

Atomic-scale manipulation in scanning tunneling microscopy has enabled the creation of quantum states of matter based on artificial structures and extreme miniaturization of computational circuitry based on individual atoms. The ability to autonomously arrange atomic structures with precision will enable the scaling up of nanoscale fabrication and expand the range of artificial structures hosting exotic quantum states. However, the a priori unknown manipulation parameters, the possibility of spontaneous tip apex changes, and the difficulty of modeling tip-atom interactions make it challenging to select manipulation parameters that can achieve atomic precision throughout extended operations. Here we use deep reinforcement learning (DRL) to control the real-world atom manipulation process. Several state-of-the-art reinforcement learning (RL) techniques are used jointly to boost data efficiency. The DRL agent learns to manipulate Ag adatoms on Ag(111) surfaces with optimal precision and is integrated with path planning algorithms to complete an autonomous atomic assembly system. The results demonstrate that state-of-the-art DRL can offer effective solutions to real-world challenges in nanofabrication and powerful approaches to increasingly complex scientific experiments at the atomic scale.


INTRODUCTION
Since its first demonstration in the 1990s 1 , atom manipulation using a scanning tunneling microscope (STM) is the only experimental technique capable of realizing atomically precise structures for research on exotic quantum states in artificial lattices and atomicscale miniaturization of computational devices.Artificial structures on metal surfaces allow tuning electronic and spin interactions to fabricate designer quantum states of matter [2][3][4][5][6][7][8] .
Recently, atom manipulation has been extended to platforms including superconductors 9,10 , 2D materials [11][12][13] , semiconductors 14,15 , and topological insulators 16 to create topological and many-body effects not found in naturally occurring materials.In addition, atom manipulation is used to build and operate computational devices scaled to the limit of individual atoms, including quantum and classical logic gates [17][18][19][20] , memory 21,22 , and Boltzmann machines 23 .
Arranging adatoms with atomic precision requires tuning tip-adatom interactions to overcome energetic barriers for vertical or lateral adsorbate motion.These interactions are carefully controlled via the tip position, bias, and tunneling conductance set in the manipulation process [24][25][26] .These values are not known a priori and must be established separately for each new adatom/surface and tip apex combination.When the manipulation parameters are not chosen correctly, the adatom movement may not be precisely controlled, the tip can crash unexpectedly into the substrate, and neighboring adatoms can be rearranged unintentionally.In addition, fixed manipulation parameters may become inefficient following spontaneous tip apex structure changes.In such events, human experts generally need to search for a new set of manipulation parameters and/or reshape the tip apex.
In recent years, deep reinforcement learning (DRL) has emerged as a paradigmatic method for solving nonlinear stochastic control problems.In DRL, a decision-making agent based on deep neural networks learns through trial and error to accomplish a task in dynamic environments 27 .Besides achieving super-human performances in games 28,29 and simulated environments [30][31][32] , state-of-the-art DRL algorithms' improved data efficiency and stability also opens up possibilities for real-world adoptions in automation [33][34][35][36] .In scanning probe microscopy, machine learning approaches have been integrated to address a wide variety of issues 37,38 and DRL with discrete action spaces has been adopted to automate tip preparation 39 and vertical manipulation of molecules 40 .
In this work, we show that a state-of-the-art DRL algorithm combined with replay memory techniques can efficiently learn to manipulate atoms with atomic precision.The DRL agent, trained only on real-world atom manipulation data, can place atoms with optimal precision over 100 episodes after approximately 2000 training episodes.Additionally, the agent is more robust against tip apex changes than a baseline algorithm with fixed manipulation parameters.When combined with a path-planning algorithm, the trained DRL agent forms a fully autonomous atomic assembly algorithm which we use to construct a 42 atom artificial lattice with atomic precision.We expect our method to be applicable to surface/adsorbate combinations where stable manipulation parameters are not yet known.

DRL implementation
We first formulate the atom manipulation control problem as a reinforcement learning problem to solve it with DRL methods (Fig. 1(a)).Reinforcement learning problems are usually formalized as Markov decision processes where a decision-making agent interacts sequentially with its environment and is given goal-defining rewards.The Markov decision processes can be broken into episodes, with each episode starting from an initial state s 0 and terminating when the agent accomplishes the goal or when the maximum episode length is reached.Here the goal of the DRL agent is to move an adatom to a target position as precisely and efficiently as possible.In each episode, a new random target position 0.288 (one lattice constant a) -2.000 nm away from the starting adatom position is given, and the agent can perform up to N manipulations to accomplish the task.Here the episode length is set to an intermediate value N = 5 that allows the agent to attempt different ways to accomplish the goal without it being stuck in overly challenging episodes.The state s t at each discrete time step t contains the relevant information of the environment.Here s t is a four-dimensional vector consisting of the XY -coordinates of the target position x target and the current adatom position x adatom extracted from STM images (Fig. 1(c)).Based on s t , the agent selects an action a t ∼ π(s t ) with its current policy π.Here a t is a sixdimensional vector comprised of the bias V = 5 -15 mV (predefined range), tip-substrate tunneling conductance G = 3 -6 µA/V, and the XY -coordinates of the start x tip,start and end positions x tip,end of the tip during the manipulation.Upon executing the action in the STM, a method combining a convolutional neural network and an empirical formula is used to classify whether the adatom has likely moved from the tunneling current measured during manipulation (see Methods).If the method determines the adatom has likely moved, a scan is taken to update the adatom position to form the new state s t+1 .Otherwise, the scan is often skipped to save time and the state is considered unchanged s t+1 = s t .The agent then receives a reward r t (s t , a t , s t+1 ).The reward signal defines the goal of the reinforcement learning problem.It is arguably the most important design factor, as the agent's objective is to maximize its total expected future rewards.The experience at each t is stored in the replay memory buffer as a tuple (s t , a t , r t , s t+1 ) and used for training the DRL algorithm.
In this study, we use a widely adopted approach for assembling atom arrangementslateral manipulation of adatoms on (111) metal surfaces.A silver-coated PtIr-tip is used to manipulate Ag adatoms on an Ag(111) surface at ∼ 5 K temperature.The adatoms are deposited on the surface by crashing the tip into the substrate in a controlled manner (see Methods).To assess the versatility of our method, the DRL agent is also successfully trained to manipulate Co adatoms on a Ag(111) surface (see Methods).
Due to difficulties in resolving the lattice of the close-packed metal (111) surface in STM topographs 43 , target positions are sampled from a uniform distribution regardless of the underlying Ag(111) lattice orientation.As a result, the optimal atom manipulation error ε, defined as the distance between the adatom and the target positions ε := x adatom − x target , is limited from 0 nm to a √ 3 = 0.166 nm, as shown in Fig. 1(b) and Methods, where a = 0.288 nm is the lattice constant on the Ag(111) surface.Therefore, in the DRL problem, the manipulation is considered successful and the episode terminates if ε is smaller than a √ 3 .The reward is defined as where the agent receives a reward +1 for a successful manipulation and -1 otherwise, and a potential-based reward shaping term 44 −(ε t+1 −εt) a that increases reward signals and guides the training process without misleading the agent into learning sub-optimal policies.
Here, we implement the soft actor-critic (SAC) algorithm 45 , a model-free and off-policy reinforcement learning algorithm for continuous state and action spaces.The algorithm aims to maximize the expected reward as well as the entropy of the policy.The stateaction value function Q (modeled with the critic network) is augmented with an entropy term.Therefore, the policy π (also referred to as the actor) is trained to succeed at the task while acting as randomly as possible.The agent is encouraged to take different actions that are similarly attractive with regard to expected reward.These designs make the SAC algorithm robust and sample-efficient.Here the policy π and Q-functions are represented by multilayer perceptrons with parameters described in Methods.The algorithm trains the neural networks using stochastic gradient descent, in which the gradient is computed using experiences sampled from the replay buffer and extra fictitious experiences based on Hindsight Experience Replay (HER) 46 .HER further improves data efficiency by allowing the agent to learn from experiences in which the achieved goal differs from the intended goal.We also implement the Emphasizing Recent Experience sampling technique 47 to sample recent experience more frequently without neglecting past experience, which helps the agent adapt more efficiently when the environment changes.

Agent training and performance
The agent's performance improves along the training process as reflected in the reward, error, success rate, and episode length, as shown in Fig. 2(a,b).The agent minimizes manipulation error and achieves 100 % success rate over 100 episodes after approximately 2000 training episodes or equivalently 6000 manipulations, which is comparable to the amount of manipulations carried out in previous large-scale atom-assembly experiments 21,25 .In addition, the agent continues to learn to manipulate the adatom efficiently with more training, as shown by the decreasing mean episode length.Major tip changes (marked by arrows in Fig. 2(a,b)) lead to clear yet limited deterioration in the agent's performance, which recovers within a few hundreds more training episodes.
The training is ended when the DRL agent reaches near-optimal performance after each of the several tip changes.In the agent's best performance, it achieves a 100 % mean success rate and 0.089 nm mean error over 100 episodes, significantly lower than one lattice constant (0.288 nm), and the error distribution is shown in Fig. 2(c).Even though we cannot determine if the adatoms are placed in the nearest adsorption sites to the target without knowing the exact site positions, we can perform probabilistic estimations based on the geometry of the sites.For a given manipulation error ε, we can numerically compute the probability P (x adatom = x nearest |ε) that an adatom is placed at the nearest site to the target for two cases: assuming that only fcc sites are reachable (the blue curve in Fig. 2(c)) and assuming that fcc and hcp sites are equally reachable (the red curve in Fig. 2(c)) (see Methods).Then, using the obtained distribution p(ε) of the manipulation errors (the grey histogram in Fig. 2(c)), we can estimate the probability that an adatom is placed at the nearest site to be between 61% (if both fcc and hcp sites are reachable) and 93% (if only fcc sites are reachable).

Baseline performance comparison
Next, we compare the performance of the trained DRL algorithm with a set of manually tuned baseline manipulation parameters: bias V = 10 mV, conductance G = 6 µA/V, and tip movements shown in Fig. 2(f) under three different tip conditions (Fig. 2(d, e)).While the baseline achieves optimal performance under tip condition 2 (100 % success rate over 100 episodes), the performances are significantly lower under the other two tip conditions, which have 92 % and 68 % success rates, respectively.In contrast, the RL agent maintains relatively good performances within the first 100 episodes of continued training and eventually reaches success rates ¿ 95 % after more training under the new tip conditions.The results show that, with continued training, the RL algorithm is more robust and adaptable against tip changes than fixed manipulation parameters.

Adsorption site statistics
The data collected during training also yields statistical insight into the adatom adsorption process and lattice orientation without atomically resolved imaging.For metal adatoms on close-packed metal (111) surfaces, the fcc and hcp hollow sites are generally the most energetically favorable adsorption sites 41,42,48 .For Ag adatoms on the Ag(111) surface, the energy of fcc sites is found to be ¡10 meV lower than hcp sites in theory 41 and STM manipula- Under the three tip conditions, the baseline manipulation parameters lead to varying performances.
In contrast, RL always converges to near-optimal performances after sufficient continued training.tion experiments 42 .Here the distribution of manipulation-induced adatom movements from the training data shows that Ag adatoms can occupy both fcc and hcp sites, evidenced by the six peaks ∼ a √ 3 = 0.166 nm from the origin (Fig. 3(a)).We also note that the adsorption energy landscape can be modulated by neighboring atoms and long-range interactions 49 .The lattice orientation revealed by the atom movements is in good agreement with the atomically resolved point contact scan in Fig. 3(b).

Artificial lattice construction
Finally, the trained RL agent is used to create an artificial kagome lattice 50 with 42 adatoms shown in Fig. 3(c).The Hungarian algorithm 51 and the rapidly-exploring random tree (RRT) search algorithm 52 break down the construction into single-adatom manipulation tasks with manipulation distance ¡ 2 nm, which the DRL agent is trained to handle.The Hungarian algorithm assigns adatoms to their final positions to minimize the total required movement.The RRT algorithm plans the paths between the start and final positions of the adatom while avoiding collisions between adatoms -note that it is possible that the structure in Fig. 3(c) contains 1 or 2 dimers, but these were likely formed before the manipulation started as the agent avoids atomic collisions.Combining these path planning algorithms with the DRL agent results in a complete software toolkit for robust, autonomous assembly of artificial structures with atomic precision.
The success in training a DRL model to manipulate matter with atomic precision proves that DRL can be used to tackle problems at the atomic level, where challenges arise due to mesoscopic and quantum effects.Our method can serve as a robust and efficient technique to automate the creation of artificial structures as well as the assembly and operation of atomicscale computational devices.Furthermore, RL by design learns directly from its interaction with the environment without needing supervision or a model of the environment, making it a promising approach to discover stable manipulation parameters that are not straightforward to human experts in novel systems.
In conclusion, we demonstrate that by combining several state-of-the-art RL algorithms and thoughtfully formalizing atom manipulation into the RL framework, the DRL algorithm can be trained to manipulate adatoms with atomic precision with excellent data efficiency.
The RL algorithm is also shown to be more adaptive against tip changes than fixed manipulation parameters, thanks to its capability to continuously learn from new experiences.
We believe this study is a milestone in adopting artificial intelligence to solve automation problems in nanofabrication.
Atom manipulation is performed at ∼ 5 K temperature in a Createc LT-STMAFM system equipped with Createc DSP electronics and Createc STMAFM control software (version 4.4).Individual Ag adatoms are deposited from the tip by gently indenting the apex to the surface 53 .For the baseline data and before training, we verify adatoms can be manipulated in the up, down, left and right directions with V = 10 mV and G = 6 µA/V following significant tip changes, and reshape the tip until stable manipulation is achieved.Gwyddion 54 and WSxM 55 software were used to visualize the scan data.

Manipulating Co atoms on Ag(111) with deep reinforcement learning
In addition to Ag adatoms, deep reinforcement learning (DRL) agents are also trained to manipulate Co adatoms on Ag(111).The Co atoms are deposited directly into the STM at 5 K from a thoroughly degassed Co wire (purity > 99.99%) wrapped around a W filament.
Two separate DRL agents are trained to manipulate Co adatoms precisely and efficiently in two distinct parameter regimes: the standard close proximity range 56 with the same bias and tunneling conductance range as Ag (bias = 5 -15 mV, tunneling conductance = 3 -6 µA/V) shown in Fig. S1 and a high-bias range 57 (bias = 1.5 -3 V, tunneling conductance = 8 -24 nA/V) shown in Fig. S2.In the high-bias regime, a significantly lower tunneling conductance is sufficient to manipulate Co atoms due to a different manipulation mechanism.In addition, a high bias (∼V) combined with a higher tunneling conductance (∼ µA/V) might lead to tip and substrate damage.

Atom movement classification
STM scans following the manipulations constitute the most time-consuming part of the DRL training process.In order to reduce STM scan frequency, we developed an algorithm to classify whether the atom has likely moved based on the tunneling current traces obtained during manipulations.Tunneling current traces during manipulations contain detailed information about the distances and directions of atom movements with respect to the underlying lattice 25 as shown in Fig. S3.Here we join a one-dimensional convolutional neural network (CNN) classifier and an empirical formula to evaluate whether atoms have likely moved during manipulations and if further STM scans should be taken to update their new positions.
Due to the algorithm, STM scans are only taken after ∼ 90 % of the manipulations in the training shown in Fig. 2(a,b).

CNN classifier
The current traces are standardized and repeated/truncated to match the CNN input dimension = 2048.The CNN classifier has two convolutional layers with kernel size = 64 and stride = 2, a max pool layer with kernel size = 4 and stride = 2 and a dropout layer with a probability = 0.1 after each of them, followed by a fully connected layer with a sigmoid

Empirical formula for atom movement prediction
We establish the empirical formula based on the observation that current traces often exhibit spikes due to atom movements, as shown in Fig. S3.The empirical formula classifies atom movements as F alse otherwise where I(τ ) is the current trace as function of manipulation step τ , c is a tuning parameter set to 2 -5 and σ is the standard deviation.
In the RL training, a STM scan is performed • when the CNN prediction is positive; • when the empirical formula prediction is positive; • at random with probability ∼ 20 -40 %; and • when an episode terminates.

Actions of trained agent
Here we analyze the mean and stochastic actions output by the trained DRL agent at the end of the training shown in Fig. 2

Tip changes
During training, significant tip changes occurred due to the tip crashing deeply into the substrate surface and requiring tip apex reshape to perform manipulation using baseline parameters.It led to an abrupt decrease in the DRL agent's performance (shown in Fig. 2(a,b)) and changes in the tip height and topographic contrast in the STM scan (shown in Fig. S7).After continued training, the DRL agent learns to adapt to the new tip conditions by manipulating with slightly different parameters as shown in Fig. S8.

Kagome lattice assembly
We built the kagome lattice in Fig. 3(b) by repeatedly building 8-atom units shown in

Soft actor-critic
We implement the soft actor-critic algorithm with hyperparameters based on the original implementation 45 with small changes as shown in Table 1.

Emphasizing recent experience replay
In the training the gradient descent update is performed in the end of each episode.We perform K updates with K = episode length.For update step k = 0 ... K -1, we uniformly sample from the most recent c k data points according to the emphasizing recent experience replay sampling technique 47 , where where N is the length of the replay buffer and η and c min are hyperparameters used to tune how much we emphasize recent experiences set to 0.994 and 500, respectively.

Hindsight experience replay
We use the 'future' strategy to sample up to three goals for replay 46

FIG. 1 .√ 3 .
FIG. 1. Atom manipulation with a DRL agent.(a) The DRL agent learns to manipulate atoms precisely and efficiently through interacting with the STM environment.At each t, an action command a t ∼ π(s t ) is sampled from the DRL agent's current policy π based on the current state s t .The policy π is modeled as a multivariate Gaussian distribution with mean and covariancegiven by the policy neural network.The action a t includes the conductance G, bias V, and the two-dimensional tip position at the start (end) of the manipulation x tip,start (x tip,end ), which are used to move the STM tip to try to move the adatom to the target position.(b) The atom manipulation goal is to bring the adatom as close to the target position as possible.For Ag on Ag(111) surfaces, the fcc (face-centred cubic) and hcp (hexagonal close-packed) hollow sites are the most energetically favorable adsorption sites41,42 .From the geometry of the adsorption sites, the error ε is limited to ranges from 0 nm to a √ 3 depending on the target position.Therefore, the episode is considered successful and terminates if the ε is lower than a √ 3 .(c) STM image of an Ag adatom on Ag substrate.Bias voltage 1 V, current setpoint 500 pA.

FIG. 2 .
FIG. 2. RL training results.(a, b) The rolling mean (solid lines) and standard deviation (shaded areas) of episode reward, success rate, error, and episode length over 100 episodes showcase the training progress.The arrows indicate significant tip changes which occurred when the tip crashed deeply into the substrate and the tip apex needed to be reshaped to perform manipulation with the baseline parameters (see Methods) The changes can be observed in the scan (see SI). (c) The probability an atom is placed at the nearest adsorption site to the target at a given error P (x adatom = x nearest |ε) is calculated considering either only fcc sites or both fcc and hcp sites (see Methods).With the error distribution of the 100 consecutive successful training episodes, we estimate the atoms are placed at the nearest site ∼ 93 % (only fcc sites) and ∼ 61 % (both fcc and hcp sites) of the time.(d, e) The RL agent, which is continually trained, and the baseline are compared under three tip conditions that resulted from the tip changes indicated in (a, b).The baseline uses bias V = 10 mV, conductance G = 6 µA/V, and tip movements illustrated in (f).

( f ) 2 FIG. 3 .
FIG. 3. Atom manipulation statistics and autonomous construction of an artificial lattice.(a) Top: Adatom movement distribution following manipulations visualized in a Gaussian kernel density estimation plot.Adatoms are shown to reside both on fcc and hcp hollow sites.Linecuts in two directions − → r 1 and − → r 2 (indicated by the blue and red arrows) are shown in the bottom figure.(b) Atomically resolved point contact scan obtained by manipulating an Ag atom.Bias voltage 2 mV, current 74.5 nA.The lattice orientation is in good agreement with (a).(c) Together with the assignment and path-planning algorithms, the trained DRL agent is used to construct an artificial 42-atom kagome lattice with atomic precision.Bias voltage 100 mV, current setpoint 500 pA.
activation function.The CNN classifier is trained with the Adam optimizer with learning rate = 10 −3 and batch size = 64.The CNN classifier is first trained on ∼ 10 000 current traces from a previous experiment.It reaches ∼ 80 % accuracy, true positive rate, and true negative rate on the test data.The CNN classifier is continuously trained with new current traces during RL training.
Fig.S6) shows that the agent uses larger biases and conductance with increasing training episodes.Second, like in baseline manipulation parameters, the agent also moves the tip slightly further than the target position.But, different from the baseline tip movements (the tip moves to the target position extended by a constant length = 0.1 nm), the DRL agent moves the tip to the target position extended by a span that scales with the distance between the origin and the target.Fitting x end (y end ) as a function of x target (y target ) with a linear model yields x end = 1.02x target + 0.08 and y end = 1.04y target + 0.03 (indicated by the black lines in Fig.S5(b, c)).Third, the agent also learns the variance each action variable can have while maximizing the reward.Finally, x start , y start , conductance, and bias show weak dependence on x target and y target , which are however more difficult to interpret.

Fig. S9. 8
Fig. S9. 8 to 15 manipulations were performed to build each unit, depending on the initial positions of the adatoms, the optimality of the path planning algorithm, and the performance of the DRL agent.Overall, 66 manipulations were performed to build the 42-atom kagome lattice with atomic precision.One manipulation together with the required STM scan takes roughly one minute.Therefore, the construction of the 42-atom kagome lattice takes around an hour, excluding the deposition of the Ag adatoms.The building time can be reduced by selecting a more efficient path planning algorithm and reducing STM scan time.

Figure 2 :
Figure 2: RL training results.(a, b) The rolling mean (solid lines) and standard deviation (shaded areas) of episode reward, success rate, error, and episode length over 100 episodes showcase the training progress.The arrows indicate significant tip changes which occurred when the tip crashed deeply into the substrate and the tip apex needed to be reshaped to perform manipulation with the baseline parameters (see Methods) The changes can be observed in the scan (see SI). (c) The probability an atom is placed at the nearest adsorption site to the target at a given error P (x adatom = x nearest |ε) is calculated considering either only fcc sites or both fcc and hcp sites (see Supplementary Figure S4).With the error distribution of the 100 consecutive successful training episodes, we estimate the atoms are placed at the nearest site ∼ 93 % (only fcc sites) and ∼ 61 % (both fcc and hcp sites) of the time.(d, e) The RL agent, which is continually trained, and the baseline are compared under three tip conditions that resulted from the tip changes indicated in (a, b).The baseline uses bias V = 10 mV, conductance G = 6 µA/V, and tip movements illustrated in (f).Under the three tip conditions, the baseline manipulation parameters lead to varying performances.In contrast, RL always converges to near-optimal performances after sufficient continued training.(f ) In the baseline manipulation parameter, the tip moves from the adatom position to the target position extended by 0.1 nm.

Figure 3 :
Figure 3: Atom manipulation statistics and autonomous construction of an artificial lattice.(a) Top: Adatom movement distribution following manipulations visualized in a Gaussian kernel density estimation plot.Adatoms are shown to reside both on fcc and hcp hollow sites.Line-cuts in two directions (indicated by the blue and red arrows) are shown in the bottom figure.(b) Atomically resolved point contact scan obtained by manipulating an Ag atom.Bias voltage 2 mV, current 74.5 nA.The lattice orientation is in good agreement with (a).(c) Together with the assignment and path-planning algorithms, the trained DRL agent is used to construct an artificial 42-atom kagome lattice with atomic precision.Bias voltage 100 mV, current setpoint 500 pA.
. For a transition (s t , a t , r t , s t+1 ) sampled from the replay buffer, max(episode length − t, 3) goals will be sampled depending on the number of future steps in the episode.For each sampled goal, a new transition (s t , a t , r t , s t+1 ) is added to the minibatch and used to estimate the gradient descent update of the critic and actor neural network in the SAC algorithm.considered successful and terminates if the ε is lower than a √ 3 .(c) STM image of an Ag adatom on Ag substrate.Bias voltage 1 V, current setpoint 500 pA.