Molecular Identification with Atomic Force Microscopy and Conditional Generative Adversarial Networks

Frequency modulation (FM) Atomic Force Microscopy (AFM) with metal tips functionalized with a CO molecule at the tip apex has provided access to the internal structure of molecules with totally unprecedented resolution. We propose a model to extract the chemical information from those AFM images in order to achieve a complete identification of the imaged molecule. Our Conditional Generative Adversarial Network (CGAN) converts a stack of AFM images at various tip-sample distances into a ball-and-stick depiction, where balls of different color and size represent the chemical species and sticks represent the bonds, providing complete information on the structure and chemical composition. The CGAN has been trained and tested with the QUAM-AFM data set, that contains simulated AFM images for a collection of 686,000 molecules that include all the chemical species relevant in organic chemistry. Tests with a large set of theoretical images and few experimental examples demonstrate the accuracy and potential of our approach for molecular identification.


Introduction
Atomic Force Microscopy (AFM) (1) in combination with dynamic operation modes (2,3) has become one of the key tools for imaging and manipulation of materials and biological systems at the nanoscale.Operated in the frequency-modulation mode (FM) (commonly known as Non-contact AFM), AFM achieves true atomic-scale resolution (2,3).The use of metal tips functionalized with a CO molecule at the tip apex, has provided access to the internal structure of molecules with totally unprecedented resolution (4,5).The main contrast mechanism for AFM with inert tips like CO is Pauli repulsion (4), that is due to the overlap of the electron densities of tip and sample.This repulsive force produces positive frequency shifts -changes in the oscillation frequency of the cantilever holding the tip due to the tip-sample interaction-that are observed as bright features in the constant height AFM images above atom positions and bonds, reflecting the molecular structure.Increasingly accurate AFM simulation models (6,7,8,9,10) have been developed to explain the observed image contrast.They have contributed to elucidate the role of the CO tilting (7), the influence of other contributions to the tip-sample interaction, like the electrostatic force (11,12), the role of the CO-metal tip charge distribution (10,13), and the interplay of the short-range chemical interaction and electrostatics in bond order discrimination and the imaging of intermolecular bonds (14).
High-resolution experimental (HR) AFM images, together with the ability to address individual molecules, have paved the way for the identification of natural products -like breitfussin A, where the structure of some of the fragments was known but methods like NMR failed to provide the global structure (15)-.HR-AFM is also key in the imaging of the intermediates (including radicals) and final products generated in on-surface reactions, shedding light into the formation processes and reaction pathways (16,17,18,19).The technique has been able to resolve more than a hundred different types of molecules in asphaltenes, the solid component of crude oil (20).Molecular identification in all of the previous cases was supported by significant information about the nature of the molecules involved, as in the case of asphaltenes, where we were dealing essentially with polycyclic aromatic hydrocarbons based on C and H atoms.In spite of the wealth of information provided by HR-AFM experiments and these advances in the interpretation of the observed contrast, the complete identification of molecular systems, i.e. the determination of the structure and composition, solely based on HR-AFM images, without any prior information, remains an open problem.
Few works have tried to tackle this problem using Artificial Intelligence (AI) techniques (21,22) to process AFM images.Deep Learning (DL) is nowadays routinely used to classify, interpret, describe and analyze images (23,24,25,26,27,28), providing machines with capabilities that surpass human beings (29).DL ability to recognize patterns could in principle be exploited to characterize the structure of molecular systems.Gordon et al. (30) implemented a model to automate the detection of spatially correlated patterns in varied sets of AFM images of self-organised nanoparticles.However, the complete atom-by-atom identification posses a significant challenge, as the effects of both geometry and chemical composition contribute to the determination of the 3D molecular charge density, that is ultimately responsible for the AFM contrast.Alldritt et al. (21) developed a Convolutional Neural Network (CNN) whose aim was to determine the molecular geometry from AFM images.The performance was excellent for the structure of quasi-planar molecules, even using the algorithm directly with experimental results.For 3D structures, they were able to recover information for the positions of the atoms closer to the tip, in a height range of 1.5 Å.However, the discrimination of functional groups produced non conclusive results.
In our previous work (22), we showed the feasibility of performing a very accurate automatic molecular classification with DL techniques for a set of 60 planar molecules, that include the most common atomic species in organic chemistry, using their theoretically simulated AFM images.Furthermore, we proposed a Variational Autoencoder (VAE) (31,32) based method to include the characteristic features of the experimental AFM images in the dataset, significantly increasing the accuracy of the model tested with experimental images.However, although this approach shows the potential to recognise both the structure and composition of molecules through AFM images, it does not come close to solving the global classification problem, since (i) classification in the usual sense with CNN just allows a finite-length output, i.e., only a finite number of structures can be classified, and (ii) we need to consider molecules with a non-planar adsorption configuration.
In this work, we address the problem of molecular identification from a completely new perspective, using visualisation techniques that map images onto images.Image translation has been widely applied for various purposes, such as image denoising (33), data compression (34,35), synthetic data generation (36) or image segmentation (37).One of the most widely accepted methods in the community for these tasks is the CGAN.This enhancement of the original Generative Adversarial Network (GAN) (38) has demonstrated an outstanding ability to colorize images, reconstruct objects from edge maps, and synthesize photos from labelled maps, among other tasks (39).In particular, the CGAN has played a key role in problems such as the fully convolutional translation from aerial photos to maps (39), that can be considered analogous to our specific goal of molecular identification through ball-and-stick molecular depictions produced from AFM images.
The architecture of a CGAN includes two neural networks: the generator and the discriminator.The generator is responsible for converting the input images into the output ones, whereas the discriminator tries to predict whether the output image is the real one (ground truth) or has been produced by the generator.The competition between these two networks forces them to improve significantly their performance during the training.For its prediction, the discriminator compares patches of the generator's input image with its output and with the real image.
Thus, these networks specialise in translating and detecting local environments of the images respectively, making the CGAN particularly suitable for molecular identification through AFM imaging, since the contrast features induced by each atom in the images depend strongly on its chemical environment and very weakly on more distant atoms.
In our CGAN implementation, the input for the generator is a stack of 10 constant-height HR-AFM images covering the range of tip-sample distances commonly used for AFM imaging, spanning a distance variation of 1 Å-.To this end, we have modified the original CGAN architecture replacing the 2D convolutions in the first layers of the generator by 3D convolutions that allow processing multiple images.Our CGAN turns the stack of AFM images into a graphical representation, the ball-and-stick depiction, where balls of different color and size represent the different chemical species and sticks represent the bonds between the atoms, providing complete information on the structure and chemical composition.The CGAN has been trained and tested with the Quasar Science Resources -Autonomous University of Madrid Atomic Force Microscopy Image Dataset (QUAM-AFM) (40), an open-access dataset that includes simulations of theoretical AFM images for a collection of 686,000 molecules that include all the chemical species relevant in organic chemistry.Thus, the model has the ability to identify the structure and composition of any organic molecule, achieving the complete generalisation of the molecular identification problem.Below, we discuss the main points of our implementation and test its performance with a large set of theoretical images and few experimental examples taken from the literature, in order to demonstrate the accuracy and high potential of this approach for molecular identification.

A CGAN model to identify molecules through their balland-stick depictions
We use a CGAN (39) to identify the molecules through ball-and-stick depictions.They represent each atomic species with balls of different colours and sizes centered at the position of the atoms, and define the structure through sticks, joining the balls, that represent the chemical bonds.Our proposal is based on the fact that this representation carries chemical information not only in the balls but also through the length of the sticks, since interatomic distances depend on the chemical species and the order of the bond (e.g.single, double and triple C-C bonds have different lengths).
The model applied for the identification is based on the implementation of the CGAN proposed in ref. (39).The CGAN model is composed of two networks, known as generator and discriminator.Figure 1 shows the structure and layers of each network.We define the stack of 10 AFM images at different tip-sample distances as input to the generator and the corresponding ball-and-stick depiction as output.Our proposal differs from the original implementation in the first layers of the generator: a dropout layer with a rate of 0.5 and two 3D convolutional layers (replacing the original 2D convolutional layers) to process the image stack.A dropout layer with such a high rate is important for the model to be able to generalize and make accurate predictions when dealing with experimental images.
During the training, the networks are confronted against each other in a zero-sum game consisting of two steps.Firstly, the generator is fed with a stack of AFM images and tries to generate the ball-and-stick representation corresponding to the molecule from which the input AFM images have been simulated.Secondly, we feed the discriminator with the AFM image stack (the same used for the generator) and also with the ball-and-stick depiction.With this data, the discriminator predicts whether the ball-and-stick depiction is the ground truth or the firstly, the generator is fed with a stack of AFM images and tries to generate the ball-and-stick representation.Secondly, we feed the discriminator with the AFM image stack (the same used for the generator) and also with the ball-and-stick depiction.With this data, the discriminator has to predict whether the ball-and-stick depiction is the ground truth or the image generated with the generator network.The models include 3D convolutional layers (red boxes), dropout layers (blue), blocks of 2D convolutional layers (yellow) and with 2D transposed convolutional layers (green).For a detailed description of each block and their corresponding layers, including the activation functions, see Methods.
image generated with the generator network.In this way, we train the two networks together in a end-to-end process in which the first network learns both to fool the discriminator and to generate images as close as possible to the ball-and-stick depiction, and the discriminator learns to guess whether the second input image is real or fake.From a practical point of view, the discriminator is a network that is only useful to force the generator to improve.Therefore, once this objective has been achieved, we discard the discriminator network.The generator is in charge of generating the ball-and-stick depiction representing the atoms and bonds, providing a complete identification of the molecule.
While most of the model details are presented in the Methods section, there are two technical points that we want to highlight as they are important in order to explain the remarkable performance of our method approach.The first one is related to how the discriminator makes its prediction.This is not achieved by a global assessment of the inputs but by comparing them segmented into patches of 16×16 pixels.This local analysis based on small patches of the images makes CGAN especially powerful in AFM image analysis, as the features induced by the structure and composition on the AFM images depend strongly on the local chemical environment and smoothly on the global molecular configuration.The second one exploits the freedom to incorporate additional terms into the loss function used during the training.As suggested in the original CGAN implementation (39), a distance L1 (defined as the sum of the absolute difference of the components of a vector) has been added to the loss function.This distance, an alternative to the usual Euclidean L2 norm, forces the generator not only to fool the discriminator, but also to produce outputs closer to the real ones and with as little blur as possible.

Testing the identification with simulated AFM images
In order to evaluate the accuracy of molecular identification through AFM with the CGAN, we perform a test with 3.015 structures randomly selected from the set of 81K molecules specifically reserved for this purpose from QUAM-AFM (see Methods).The test was not performed on the complete test set due to the fact that the evaluation was carried out by human visual comparison between the target structure and the one predicted by the model.For each of these structures, we randomised the selection of the simulation parameters among the 24 possible combinations offered by QUAM-AFM (see Methods), resulting in 3.015 stacks of 10 tip-sample distance AFM images.
The results of the test shown in fig. 2 demonstrate that our method works with outstanding results: theoretically simulated AFM images contain sufficient information to carry out These strongly electronegative atoms hide their bonds with the sp3 carbons, creating a triangular feature at the position of the ring and hiding also the presence of the N atom attach to it.
Nevertheless, the model is able to differentiate sp3 and sp2 carbons and identify the two amino Figure 3: Accuracy of the model in a test where both the 3015 structures and their simulation parameters have been randomly selected.The bar charts show (from left to right) the overall accuracy (perfect structure and atom prediction), the accuracy of structure discovery, and the accuracy in revealing the atomic species.The set of structures has been divided into four subsets according to their torsion in order to show the dependence of the model accuracy versus the height difference in the atoms of the molecule.The horizontal dashed line shows the accuracy over the complete test set.The (total) accuracy has been evaluated considering that the final result is correct only if the prediction is perfect: it shows all the bonds of the molecule, the number of vertices of each structure (chain or rings), and the proper color assigned to each atom, with the exception of the hydrogens and its bonds.The structure accuracy has been calculated as the percentage of fully discovered (perfect) structures out of the total set of structures.The accuracy in the prediction of the atomic species has been evaluated as the percentage of total hits (correct predictions) over the total number of atoms in the set, without considering the hydrogens.See table S1 for details.
groups, leading to a perfect prediction.Figure 3 provides a quantitative estimate of the accuracy of our identification method using a global assessment and two specific evaluations focused on either structure or composition.
The model achieves a remarkable 74% of perfect predictions, that increase to 95% (96%) when considering only structure (composition).Notice that, in the total accuracy and the structure accuracy, a prediction has been considered correct only if there is a perfect match, whereas the accuracy in the prediction of each atomic species has been assessed by considering each individual atom in the molecule as correct or incorrect.This method of evaluation penalizes errors in structure discovery more than in atom determination, since in all the predictions most of the structure is revealed correctly, providing valuable information about the molecule, in spite of been considered as incorrect in the determination of the accuracy.
We have explored the influence of the molecular corrugation -the maximum height difference of the atoms in the molecule (excluding hydrogens), where the height is defined as the distance between atoms measured perpendicular to the molecular plane-., in the performance of the model.The force curves associated with certain atomic species in different molecular moieties are quite similar.In fact, in some cases, these curves are almost identical except for a rigid translation, equivalent to a vertical displacement of the atoms.Thus, we could expect the model to mistake some of these atoms in non-planar structure where they are at different heights.The test set was split into four subsets according to the maximum height difference and the accuracy was evaluated independently for each subset.According to fig. 3, both the total and the composition accuracy decrease linearly with the maximum height difference, while the structure accuracy shows this linearly behavior in the range [0,1.5]Å but has a stronger decay from 1.5 Å onwards.ence, the left panels in fig. 4 (b, c and d) show two representative AFM images, the prediction and the real structure for three molecules that have a strong torsion in their gas-phase configuration.These images show that the model perfectly identifies chemically and structurally the top part of the molecules, but fails with the bottom, where the CO tip cannot retrieve enough information during constant height imaging, even at the shortest tip-sample distances, due to the CO lateral relaxation.These results explain the lower accuracy of the model for the molecules with stronger torsion, particularly in the case of the structure accuracy, that requires a perfect identification of the whole molecular structure.At the same time, it seems to confirm that there is a limit beyond which it is not possible to obtain information from an AFM with the current operation setups and with a single adsorption orientation of the molecule (21).
We do not expect this limitation to be so crucial when dealing with a molecular identification based on experimental images, where the molecules are deposited on a substrate.The final adsorption configurations are significantly flatter than the gas-phase ones, as the attractive molecule-substrate interaction compensates the steric hidrance effects responsible for the torsion, even in the low reactive substrates commonly used for AFM experiments.This idea has been tested with the three molecules in fig. 4 (b, c and d).The left panels of fig. 4 (c) show that in the gas phase structure, the model correctly predicts that bromine is a halogen (by bond length and ball size) but does not determine the color of the ball.A similar case is presented in fig. 4 (d), where several atoms are misclassified.We have forced these three molecules to acquire a flat structure.The corresponding AFM images, the new prediction and the structure are shown on the right panels of fig. 4 (b, c and d).The prediction becomes perfect with respect to the structure in all of the three cases, and, composition-wise, fails only in a single atom in the case displayed in fig. 4 d.
After the analysis presented above, it is sensible to ask if the choice of training the model with the structures in QUAM-AFM, that correspond to gas-phase configurations, is the best option for molecular identification based on experimental images.This choice have been taken in the first place to make the simulation computationally feasible, as it is simply not possible to perform the relaxations needed to determine the adsorption configurations of all the molecules in the data set on a number of different substrates.However, our choice, more that a practical consideration, is actually guided by the fact that the AFM contrast of the different chemical  Beyond the subtleties in the AFM contrast created by the interplay of the chemical nature of the atoms, their chemical environment and their relative height, we have identified some misclassifications that occur with some frequency, even in rather flat configurations.In this case, although chemically they have different properties, the fact that the atoms are very electronegative and have a similar charge distribution reflects in the similar features they show in the AFM simulations in a perfectly planar configuration (see fig. 4 (g and h)).This fact makes them extremely difficult to identify in the presence of small variations in height.Another pair that is frequently mistaken for variations in height is O and F when connected to an aromatic ring (see fig. 4 (d)).This case is more surprising since, even though the two atoms are highly electronegative and of similar size, the O double bonded to a C of an aromatic ring should, at first, show some distinctive feature with respect to a C-F.Although the F and O features are similar, one would expect them to be distinguishable in a planar structure.It is not clear whether this error is due to some unknown effect on the structure or, perhaps, as they have similar sizes in the ball-and-stick representation, the model mistakes them under certain conditions.

Molecular identification based on experimental AFM images
The final goal of our CGAN model is to identify molecules from their experimental AFM images.As discussed above, the range of AFM operational parameters used to simulate the images generated for each of the molecules and the use of gas-phase configurations introduce enough To test the performance of the model with experimental results, we have selected sets of AFM images originally published in refs (43,44,45,46,47,48).In general, fewer than ten images corresponding to different tip-sample distances were published in these papers, so we have linearly interpolated the images two by two to extract additional images to complete the input, the stack of 10 images, required for the CGAN model.In some cases the experimental results were so limited, that it was necessary to weigh differently each image to obtain multiple results from each image pair (see fig. 5 and figs.S1 and S2).We have denoised the generated 10-image stack by applying the medianBlur filter with size 3 from the OpenCV Python package.
It is important to stress that the interpolated images are generated for the sole purpose of completing the input dimensions required by the model, i.e. they do not provide additional information to that supplied by the original images.Therefore, the test with experimental images is really tough: We are not only increasing the complexity by using as inputs experimental images -simply cut and edited from different publications and that, in spite of the applied filter, always carried some noise-, but we are also severely reducing the amount of information with which we feed the model.
A drawback that may hinder chemical identification by experimental AFM imaging is that the observed interaction depends on the details of the tip structure, like the attachment of the CO molecule to the metal tip. Figure 5 (a) shows experimental AFM images, taken at constant height and acquired with a CO-terminated tip, for a 1-azahexacyclo[11.7.1.1 3,19 .0 2,7.0 9,21  .0 15,20 surface.These AFM images (and, by inheritance, also their interpolations) show an imperfect threefold symmetry.Although this asymmetry could be related to the adsorption configuration of the molecule, the discussion in ref. (43) proves that it is really caused by the flexibility of the CO-Cu bond coupled with an asymmetric tip.Therefore, the chemical identification of this molecule has two additional complications, besides the lack of input data and the switch to experimental images: First, this structure is not part of the training set, so, in addition to testing the model with an experimental image, this is a perfect example to verify its ability to generalise.
On the other hand, because in the theoretical simulations tip irregularities are not considered, the model has not been trained with images containing characteristic features induced by these asymmetrical tips in the experimental images.Despite these drawbacks, the CGAN is not only able to reveal the molecular structure but also to predict with perfect accuracy the chemical species that make up the molecule.
Besides been robust against tip asymmetries, the model seems to perform better in the determination of the chemical composition with experimental images.As discussed above (see section 3.1), one of the most common errors in the tests performed with simulated images was to mistake O for F in complex molecules, as they produced a similar AFM contrast.However, in the prediction of this molecule through the experimental AFM images, where the symmetry is affected by the irregularity of the tip, the model identifies the three oxygens with absolute accuracy (see fig. 5 (a)).It is not possible to make a general statement since the test with oxygens is limited to their presence in this particular structure, but this result seems to indicate that our CGAN is able to clearly differentiate some chemical species, like oxygens and fluorines, in experimental images.
Our CGAN model seems to work also with constant-height images taken using different AFM operation modes.approach but to the same oscillation amplitude.Moreover, the tip-height range covered by the images (64 pm) is significantly smaller than the 100pm that we consider optimal and has been chosen so that similar contrast features were shown in the amplitude, phase and FM images.Finally, we have included in our analysis the amplitude image at the closest distance, that shows a significantly different contrast.
In spite of these severe limitations in the input, the model fed with the amplitude images fully reveals the molecular structure and the presence of the I atom.In the case of phase and FM images, the model gives a good description of the molecular structure but fails to provide a clear prediction about the halogen, since the color is more like the one associated to bromine  The prediction of the model has not been so accurate in all experimental tests.Figure 5 (d  and e) shows the test performed with AFM images of dibenzothiophene and [19]Dendriphene respectively.In the dibenzothiophene prediction, the model gets right both the number of rings and the number of vertices in each ring, which is clear in the AFM images taken at shorter tipsample distances.However, the model is not able to rescale the central ring to show the bonds with their correct size.Furthermore, although the model manages to reveal a slight yellow color at the sulphur apex, the size of the bonds in the prediction is larger than in the target, so the prediction is not conclusive.It has to be noticed that, despite applying a filter, we were not able to remove the experimental noise completely.Furthermore, the central ring appears, for some unknown reason, much more deformed than in the theoretically simulated images.These two features of the experimental images may account for the failure of the prediction.However, our previous work (22) shows that these problems with experimental images can be fixed.We proposed a strategy that significantly improves the accuracy in the classification of a small set of molecules, including dibenzothiophene, from experimental images.We implemented a VAE to incorporate, from just three experimental images, characteristic features into the training set that produce an increase in the accuracy of 0.28 (from 0.62 to 0.90) in the particular case of dibenzothiophene and an increase of 0.2 for the whole set of molecules.This strategy can be extended to our CGAN model to incorporate during the training images containing experimental features in order to improve its accuracy.
The [19]dendriphene prediction is also partly a failure.Although it reveals a large part of the structure, it does not close five of the six peripheral rings.Moreover, while in most cases, the prediction of the presence of carbon atoms is correct, the model tints some areas of the structure with bluish tones that do not allow to conclusively determine whether the chemical species is a carbon or a nitrogen.It has to be noticed that the test has been performed with only three experimental images, that is, less than a third of the information with which the model was trained.At the same time, it is also remarkable, that, even for such a complicated test and with a very limited input information, the number of vertices of each revealed ring is correct.

Discussion
In We attribute the high performance of the model to the consistency and robustness shown by CNNs in the analysis of images with DL, together with the patch analysis performed by the discriminator and the use of a suitable loss function, with an L1 distance, that increase the sharpness of the predictions and makes the mapping between input and output accurate.The reduced accuracy shown for structures that have a very high internal torsion is not a critical issue when facing the identification from experimental AFM images, as real adsorbed structures tend to be flatter than the corresponding gas-phase ones.Moreover, in these high-torsion cases, the model correctly reveals both the structure and the chemical species located on the top areas of the molecule.The presence of atoms in the lower areas is indicated with bonds that are eventually blurred due to the lack of information.Thus, more than a problem of the model, this reduced accuracy represents an intrinsic limitation of the current AFM set-ups, that may be fixed by an alternative operation mode.
The few results presented for molecular identification based on experimental AFM images, in spite of the incomplete information available, are really promising.An experimental collaboration in which AFM images are systematically taken in the conditions in which QUAM-AFM was simulated, both for training the model and for testing, would be necessary to properly assess the potential of the model.

QUAM-AFM Data Set
DL models need large datasets to adjust the weights in each of their layers.In this work, we take advantage of QUAM-AFM (40), an open-access dataset that includes simulations of theoretical AFM images, based on the latest HR-AFM modeling approaches (14,49,40), for a collection of 686,000 molecules that include 10 different atomic species (C, H, N, P, O, S, F, Cl, Br, I).
Here we provide the main characteristics that are relevant for our study and refer the reader to the original publication (40) for details.QUAM-AFM focuses on quasi-planar molecules, that is, molecules which display height variations up to 1.83 Å along the z-axis in order to include aliphatic chains and sp 3 carbon atoms (methyl groups) as possible side groups.
The contrast of AFM images taken in the FM mode with CO-metal tips depends on parameters, such as the cantilever oscillation amplitude or the average tip-sample distance, that can be controlled during operation, and also on the tip nature, in particular, differences in the attachment of the CO molecule to the metal tip that have been consistently observed and characterised in experiments (49,50,51).In order to cover the widest range of variants in the AFM images, QUAM-AFM was simulated with 6 different oscillation amplitudes of the cantilever

CGAN Molecular identification model
The generator for the identification of molecules through AFM images is composed of a series of similar blocks where the main difference is the number of kernels applied in each convolution and the dimensions of each input (see fig. 1(a)).The input consists of a stack of 10 greyscale AFM images (a single channel).This stack is processed in a dropout layer, with a rate of 0.5, followed by two 3D convolutional layers.The first 3D convolution includes 64 kernels, each of them has (4,3,3) size and is applied with a stride of (3,1,1) and padding.The second 3D convolution also has 64 kernels but, in this case, the kernels have size (4,4,4) and are applied with a stride of (4,2,2).The output of the second convolutional layer is resized to (128,128,64) and activated with a Leaky ReLU (LReLU) function.
From this point on, the encoder consists of seven blocks, represented by yellow boxes in fig.1(a).Each block includes a 2D convolution followed by a batch normalisation and a LReLU activation function with α = 0.2.All kernels of the 2D convolution have size (4,4) and are applied with a stride of (2,2).The 2D convolutional layers have 128, 256, 512, 512, 512, 512, and 512 kernels, taking as reference the processing direction from the one closest to the input to the one closest to the compressed representation space.The outputs of the activations are used both to feed the next block of the encoder and to feed the decoder block of the same size.The generator decoder blocks, represented by green boxes in fig.1(a), include the following layers: a transposed convolution, a batch normalization, a dropout layer with rate 0.2 (only in the three layers closest to the space of the compressed representation, see fig.1), a concatenation with the output of the corresponding encoder block, and, finally, a Rectified Linear Unit Activation Function (ReLU) activation (except for the last block, the one closest to the output, that is activated with an hyperbolic tangent function).
The discriminator (fig.1(b)) consists of a sequence of layers, initiated by a concatenation of all input images (note that we can consider the 10 AFM images as a single image with 10 channels).It is followed by a 2D convolutional layer with 64 kernels of size (4,4) and stride of (2,2) activated with LReLU.Then, it has four blocks consisting of a 2D convolutional layer, a batch normalization and a LReLU activation (α = 0.2).The convolutions have 128, 256, 512 and 512 kernels with size (4,4) and stride (2,2) respectively.The last layer is a 2D convolution with a single kernel of size (4,4) which is activated with the sigmoid function.

CGAN Training
The 686K structures in QUAM-AFM have been split into training, validation and test sets with 581K, 24K and 81K structures respectively.The test set is chosen to be particularly large for two reasons.Firstly, to perform a quantitative analysis with randomly chosen structures in order to avoid an statistical fluke.Secondly, it is desirable to have sufficient variety of structures to be able to show examples that reflect the most salient strengths and weaknesses of the model.
During training, we randomly choose one of the combinations of AFM simulation parameters available in QUAM-AFM for each input stack.This variability in the input data makes sure that the parameters with which the AFM experiment has been carried out do not play a decisive role in the success of the identification, prevents overfitting, and provides the model with the ability to generalize.This variability is further enhanced with the application of an IDG to the training set.This technique, commonly used in DL, applies different deformations (zoom, rotations, shifts, flips and shear) to the input images.Let's recall that the ball-and-stick depictions included in QUAM-AFM share the same scale as the AFM images.Thus the IDG has to be applied to both the input AFM images and the ball-and-stick depiction during the training: i.
e., if we rotate the input AFM images, then, the corresponding ball-and-stick depiction must be rotated with the same angle.Otherwise the atomic positions of the ball-and-stick representation would not match the corresponding atomic positions of the AFM images, and the CGAN would not be able to learn a local translation (from the pixel environment) between the shape and intensity of the AFM image and the type of atom that caused it.This applies to all the operations in the IDG except for the shear, that is not applied to the output ball-and-stick depiction.This is motivated by the fact that shear represents a deformation that may appear in the experiments due to noise or tip asymmetries but it should not be present in the prediction.
We have found that the selection of appropriate deformation parameters for the IDG applied to the training set during the fitting considerably increases the accuracy of the model in the test carried out with experimental images (22).An particular example of the application of the IDG and information on the range values used for the different operations can be found in Fig. S3.
Regarding the loss functions, the generator of the CGAN was compiled with Mean Absolute

S1 Model Accuracy with Simulated AFM Images
The objective of our Conditional Generative Adversarial Network (CGAN) is the identification of molecules through experimental Atomic Force Microscopy (AFM) imaging.However, a test with a set of computationally simulated images sheds light on the actual capability of the model due to the control we have over the molecular features.In particular, we emphasize the strong dependence of the model on the difference in atom heights.To assess this dependence it is required to have absolute control over the molecular torsion, which is not possible with current experimental techniques.Table S1 shows the details of a test set of 3015 molecules.For this purpose the test set has been divided into four subsets according to the maximum difference in height of the atoms in the molecule (considering the height as the distance as the distance between atoms measured perpendicularly to the molecular plane) in the performance of the model.
We emphasize that the structures used in the simulation are in gas phase, which leads some of z diff.Support Acc.S1: Accuracy of the model in a test where both the 3015 structures and their simulation parameters have been randomly selected.The table shows (from left to right) the range of maximum height differences between the atoms of each molecular structure in the evaluated ensemble, the number of structures in the evaluated set, the overall accuracy (perfect structure and atom prediction), the accuracy of structure discovery, and the accuracy in revealing the atomic species.The accuracy of the structures has been evaluated considering that the final result is correct only if the prediction is perfect: it shows all of the bonds of the molecule and the number of vertices in each structure (chain or rings), with the exception of the hydrogens and its bonds.The structure accuracy has been calculated as the percentage of fully discovered (perfect) structures out of the total set of structures.The accuracy in the prediction of the atomic species has been evaluated as the percentage of total hits (correct predictions) over the total number of atoms in the set, without considering the hydrogens.
them to have a strong torsion.In case of depositing these molecules on a surface, the internal torsion would decrease significantly due to the interaction with the substrate, and almost all of them would belong to the range of less than 50 pm, where the result of the simulation would be closer to the images obtained in the experiments that are necessarily carried on molecules adsorbed on a substrate.
Table S1 shows the power of the model to resolve molecules with very high accuracy in cases where the molecule is in a roughly flat configuration.Moreover, there seems to be a limiting distance beyond which the model cannot provide information.The motivation for this decrease in accuracy lies in the constant height technique used to obtain the images, where the distance between the tip and the atoms closest to the surface is too long for the tip-sample interaction to provide sufficient information for identification.As discussed in the main text, in spite of these limitations, the training with gas-phase structures that have a high internal torsion, rather than being a limitation, is enhancing the ability of the model to generalize and to recognise molecules in different adsorption configurations.

S2 Experimental Test Results
The   during training (with the exception of shear, which is not applied to the output ball and stick representation).

Figure 1 :
Figure 1: Our implementation of the CGAN structure.During the training the generator model (a) and the discriminator model (b) are confronted against each other in a zero-sum game:firstly, the generator is fed with a stack of AFM images and tries to generate the ball-and-stick representation.Secondly, we feed the discriminator with the AFM image stack (the same used for the generator) and also with the ball-and-stick depiction.With this data, the discriminator has to predict whether the ball-and-stick depiction is the ground truth or the image generated with the generator network.The models include 3D convolutional layers (red boxes), dropout layers (blue), blocks of 2D convolutional layers (yellow) and with 2D transposed convolutional layers (green).For a detailed description of each block and their corresponding layers, including the activation functions, see Methods.

Figure 2 (
b and c) shows other remarkable achievements of the model, such as the identification of sp3 carbons, sulphur, oxygen and nitrogen atoms in different chemical environments and the accurate discrrimination of three different halogen species (Cl in fig. 2 (b) and I and Br in fig. 2 (c)).

Figure 4 Figure 4 :
Figure 4 provides some important hints on the origin of the limitations of the model revealed by the statistical analysis presented above.Starting with the role of the maximum height differ- species is strongly influence by the chemical environment.Training the model with the molecular structures in QUAM-AFM, that, in general, do not correspond to the adsorbed configuration in the experiments, is providing the model with the necessary information to learn the local relationships that the different chemical species may have depending on the height.Instead of learning to identify a structure in one particular configuration, the model is learning to relate atoms to their surroundings, allowing it to recognise molecules in different configurations.

Figure 4 (
Figure 4(a) demonstrates this idea.It shows the AFM images calculated for the stable adsorption configuration of mDBPc on two different substrates: a more reactive Ag(111) surface and a rather inert NaCl bilayer.The final structures are quite different and neither of them is flat.This reflect in the different AFM contrast, that is in excellent agreement with the experiments (42).When the stack of images corresponding to these two configurations is shown to our model, the prediction for the structure and composition of the molecule is perfect in both cases, except for the position of the two internal hydrogens that are always very difficult to determine from AFM experiments.This example with theoretical images and the experimental cases discussed below show that the training with the highly corrugated gas-phase configurations, although not enough to keep its global accuracy in the tests performed with molecules with strong torsions, is actually an important asset of the model.These structures are making the model robust by showing how features associated with atomic species and molecular moieties evolve with the variation of height in different chemical environments.The choice of the molecular adsorption configurations on a particular substrate for training may lead the model Figure 4 (e and f) shows two examples where the model swaps a N-H group in a pentagon for an O atom.
variability during the training to allow the model to identify the molecule, despite the differences introduced by the substrate.We have explicitly tested this point with theoretical AFM images generated for the adsorption configurations of mDBPc on two different substrates with quite different reactivity, a Ag(111) surface and a NaCl bilayer (see fig.4 (a)).The theoretical AFM images faithfully reproduced the experimental results(42).Now, we want to assess the accuracy of the model with experimental results.This test is going to be limited by the scarce number of published AFM studies that include sets of images as a function of the tip height.Furthermore, most of these few studies neither provide sufficient images (10 images, taken at 0.1 Å intervals) nor are in the range of tip-sample distances (2.80-3.70Å) which our analysis with simulated images have shown necessary to properly sample the variation of the tip-sample interaction and achieve complete chemical identification.Despite these drawbacks, the results presented below are really promising.

Figure 5 (
b) shows the prediction performed for 2-iodotriphenylene on Ag(111) with a stack of AFM images taken using the measured oscillation amplitude in a new operation mode, Q-controlled AM-AFM with CO-functionalized tips operated in constantheight mode, proposed in ref. (44).The AFM images resulting from using both phase modulation in Q-control AM-AFM and frequency modulation (FM) modes on the same molecule as well as the respective predictions performed by the model are shown in fig.S1.As discussed in section S2, none of these AFM images correspond to the AFM operation mode used to simulate the AFM images employed in the training of the model.This is clear for the amplitude (fig.5 (b)) and phase images, but it is also the case in the FM images, as the oscillation amplitude is very different (varying from 45 to 525 pm) in each of the experimental images, while the10-image stacks used in the training correspond to different tip-sample distances of closest than the one corresponding to iodine (See fig.S1).Far from considering these predictions a failure, these results indicate that our CGAN model can provide very useful information regarding the molecular identification when fed with images taken with different AFM operation modes.Nevertheless, more work is needed to reach a final conclusion about the merits and limitations of our model for this particular case, 2-iodotriphenylene on Ag(111), as shown by the analysis of another series of constant-height images taken in the frequency modulation mode for the molecule and for the products of a dehalogenation reaction locally triggered using a voltage: a triphenylene (TP) radical and the cleaved I atom (48) (see section S2).The image features at the halogen position and its evolution with tip height in Fig.S2(a) are quite different from those shown in other experimental examples and from our AFM simulations, and the model predicts a methyl group instead of a halogen.In the case of the dehalogenation products (Fig.S2(b)), our model captures the presence of the cleaved I atom and provides a strongly deformed structure where the dehalogenated ring is not closed, consistent with the lack of information in the AFM images due to the strong bending of the molecule towards the substrate induced by the interaction of the unsaturated C bond in that ring with the metal.

Figure 5 (
Figure 5 (c) shows another rather successful identification, in this case, a 21,23-dihydroporphyrin molecule.The test has been carried out with interpolations from five experimental images that cover tip-sample distances varying in a range of 1 Å, although the average distance seems to be larger than the one used in the simulations of QUAM-AFM.The model is able to reveal the four pentagonal rings and the position of the nitrogens.
summary, our results show the potential for chemical and structural identification of molecules encoded in AFM images.We propose a CGAN to generalise the accurate classification of a small set of molecules achieved in our previous work(22) into a general purpose tool to completely determine the structure and composition of arbitrary organic molecules.Our model performs a direct translation between a stack of 10 constant-height AFM images and the balland-stick depiction of the molecule.We are only limited by the fact that the atoms composing the molecule have to be in the training dataset.Since QUAM-AFM(40) includes the most relevant chemical species in organic chemistry, the model prediction is practically unconstrained.Molecular identification in both theoretical and experimental images is highly accurate with a model trained exclusively with theoretical images.The ability of the model to reveal molecular structures and chemical species is truly remarkable, beyond the capabilities of a human expert in the field.Moreover, these identifications are not conditioned to a single molecular configuration, since the differences in height of the atoms in the gas-phase structures included in the training dataset provide enough information to identify patches of the image according to the chemical environment of each atom.In this way, the model has learned to decipher the distortions produced by each chemical species in relation to its surroundings regardless of the relative height difference in the molecule.

( 0 .
40, 0.60, 0.80, 1.00, 1.20, 1.40 Å), 10 tip-sample distances (2.80, 2.90, 3.00, 3.10, 3.20, 3.30, 3.40, 3.50, 3.60, 3.70 Å), an 4 values of the elastic constant describing the tilting stiffness of the CO-metal bond (0.40, 0.60, 0.80, 1.00 N/m).These 240 combinations are applied to each of the molecular structures, resulting in a total of 165 million grey-scale images with resolution 256 × 256 pixels.QUAM-AFM also provides the ball-and-stick depictions of each molecule generated from the atomic coordinates.These depictions share the same scale used in the AFM images: if we superimpose the two images, each ball of the representation is centered on the position occupied by the atom it represents in the AFM images.
Error (MAE) (using the parameter λ = 100 defined by Isola et al.(39)) , while the binary cross entropy was used for the discriminator.The model was minimised by applying batches of 32 inputs with the Adaptive Moment Estimator (Adam) optimiser, where the learning rate and first moment parameters were set to 2 • 10 −4 and 0.5 respectively.The training of the model was carried out during six epochs (109K iterations), displaying 300 predictions of the validation set to estimate the optimal training point every 10.000 iterations.
Figure S1: (a) phase modulation AFM images and (b) frequency shift (FM) images for 2iodotriphenylene on Ag(111).First row of each panel shows the six experimental AFM images originally published in Figure S4 in ref.(1) and reproduced by courtesy of the American Physical Society (APS).The second row shows the results of the interpolation.The right-hand column shows the ball-and-stick depiction of the structures and the predictions performed by the model.In spite of the different operation modes (FM images are taken with different oscillation amplitudes) and the reduced tip-height range of 64pm (compared to100 pm in training), our CGAN is able to fully reveal the ring structure and the presence of halogen atom although the predicted color is more like the one associated to bromine than the one corresponding to iodine.Phase images where taken in constant height scans while operating the AFM in amplitude modulation with Q-control (Q ef f = 2060) using free oscillation amplitudes of 45, 70, 114, 175, 280, and 525 pm.FM images where also taken constant height scans using the same oscillation amplitudes.Before each phase image, the STM feedback was activated above the Ag(111) surface with the same tunneling parameters (I =10 pA, U = 7 mV) and the desired oscillation amplitude in FM mode.The average distances between the tip and the substrate were then chosen with respect to this reference so that similar contrast features were shown in the phase and FM images and also in the amplitude images shown in Fig.5in the main text.These average tip-substrate distances are 76, 87, 100, 111, 121, and 130 pm for the different oscillation amplitudes, as marked in the figure.4

Figure
Figure S1 shows six phase (top of panel a) and frequency modulation (FM) (top of panel b) AFM images published in Figure S4 in the supplementary material of ref. (1), together with the interpolations that we have made to complete the 10-image stack needed as input for our CGAN, and the predictions of our CGAN.Phase images where taken in constant height scans while operating the AFM in amplitude modulation with Q-control (Q ef f = 2060) using free oscillation amplitudes of 45, 70, 114, 175, 280, and 525 pm.FM images where also taken constant height scans using the same oscillation amplitudes.Before each phase image, the STM feedback was activated above the Ag(111) surface with the same tunneling parameters (I =10 pA, U = 7 mV) and the desired oscillation amplitude in FM mode.The average distances between the tip and the substrate were then chosen with respect to this reference so that similar contrast features were shown in the phase and FM images and also in the amplitude images shown in Fig.5in the main text.These average tip-substrate distances are 76, 87, 100, 111, 121, and 130 pm for the different oscillation amplitudes, as marked in the figure.Before looking at the predictions of our CGAN, it has to be stressed that none of these AFM images correspond to the AFM operation mode used to simulate the AFM images employed in the training of the model.This is clear for the phase images but it is also the case in the FM images, as the oscillation amplitude is different in each of the images while the10-image stacks used in the training correspond to different tip-sample distances of closest approach but to the same oscillation amplitude.Moreover, the tip-height range covered by the images (64 pm) is significantly smaller than the 100pm that we consider optimal.In spite of these strong limitations in the input, our CGAN is able to fully reveal the ring structure and the presence of

Figure
Figure S2 shows the predictions of our CGAN made with the images published in the supplementary material of ref. (3) Figure S2(a) corresponds to a series of frequency shift images of 2-iodotriphenylene in constant height-mode (Vgap = -0.57mV, A = 52 pm) taken from Fig S1(g) in ref. (3) and the corresponding prediction.The distance values (z values) are given with respect to a tunneling current of I = 10 pA and a gap voltage of Vgap = 4 mV above the Ag(111) surface (see ref. (3) for details).In this case, the model fails to predict the presence of the halogen, which is replaced by a methyl group.A priori, this should be a simple case, since there are 9 AFM images and the tip-to-sample distance increments are similar to those used to simulate the QUAM-AFM image stacks.However, for some unknown reason, the image features at the halogen position and its evolution with tip height are quite different from those shown in other experimental examples and from our AFM simulations.We speculate that this is the reason behind the failure of the model in the identification of the iodine.More work on both the experimental and theoretical side would be required in the future to properly understand this case.Lastly, fig.S2(b) shows a series of frequency shift images in constant-height mode (Vgap = -0.57mV, A = 52 pm) of the products of a dehalogenation reaction locally triggered using a voltage: a triphenylene (TP) radical and the cleaved I atom.(Fig S2(g) in ref. (3)) The distance