Recreation of the periodic table with an unsupervised machine learning algorithm

In 1869, the first draft of the periodic table was published by Russian chemist Dmitri Mendeleev. In terms of data science, his achievement can be viewed as a successful example of feature embedding based on human cognition: chemical properties of all known elements at that time were compressed onto the two-dimensional grid system for a tabular display. In this study, we seek to answer the question of whether machine learning can reproduce or recreate the periodic table by using observed physicochemical properties of the elements. To achieve this goal, we developed a periodic table generator (PTG). The PTG is an unsupervised machine learning algorithm based on the generative topographic mapping, which can automate the translation of high-dimensional data into a tabular form with varying layouts on-demand. The PTG autonomously produced various arrangements of chemical symbols, which organized a two-dimensional array such as Mendeleev’s periodic table or three-dimensional spiral table according to the underlying periodicity in the given data. We further showed what the PTG learned from the element data and how the element features, such as melting point and electronegativity, are compressed to the lower-dimensional latent spaces.

shows a heatmap of the elements' data used in this study. A detailed description of the elements-level properties is given in Fig. S2.

Detailed Method of GTM-LDLV
Our learning method can be considered as an extension of generative topographic mapping (GTM) proposed by Bishop et al. [1]. GTM is a latent variable model that represents the probability density of data using a nonlinear function of lower dimensional latent variables. It can be regarded as a stochastic formulation of self-organizing map (SOM) [2].
In GTM, grid points (called "nodes" hereafter) & , ⋯ , ) regularly arranged in the -dimensional latent space are prepared for data visualization, and consider a nonlinear function ( -; ) that maps the nodesto a pointon the -dimensional feature space. The dimension of the latent space is set less than 3 for visualization. is a parameter set that determines ( -; ). It is assumed that thedimensional feature vector 4 is generated independently by a restricted mixture of K Gaussian distributions, where all mixing coefficients are 1/ , the mean of the Gaussian distribution is -, and the covariance matrix is all 8& . Then, the distribution is given by (1) If the function ( -; ) is a smooth nonlinear function, then nodesare mapped ontowhile maintaining the topological relationship in the latent space. GTM is seen as a mixture of Gaussian distributions, which meansare restricted to the lower dimensional manifold.
In GTM, the function ( -; ) is constructed by a Gaussian process (GP) [3]. The nature of the GP is determined by the choice of a covariance function. The conventional GTM model uses a covariance function with a constant length scale throughout the latent space. This model cannot locally change the smoothness of the nonlinear function representing the distribution of the observed data according to the value of the latent variable. The underlying patterns of the element data are considered nonlinear and highly complex, thus we require a GTM model that can represent more flexible functions. Therefore, we focused on GTM-LDLV [4], which is a recently proposed GTM model that can control the smoothness of the nonlinear function locally according to the value of the latent variable.
In GTM-LDLV, it is assumed that the -dimensional feature vector 4 is generated independently by a restricted mixture of K Gaussian distributions defined in equation (1), and the nonlinear function ( -) is modelled to be the product of two functions: a -dimensional vector-valued function ( -) and a positive scalar function ( -). Then, their joint distribution is given by where is a vector ( -) ( = 1, ⋯ , ) , and is a matrix ( -) ( = 1, ⋯ , ) . The prior distribution of ( ) is given as a truncated GP with mean 0 and covariance function V ( W , X ; V ), which handles positive-bounded random functions. The prior distribution of the th entry ℎ \ ( ) of ( ) is given as a GP with mean 0 and covariance function ] ( W , X ). The prior distribution of the parameters and are given by where ^ is a truncated normal distribution which handles positive-bounded random functions, (\) is a vector of the th entry of the matrix D , and ] is a matrix which consists of covariance function ] c W , X d as an element. Specifically, the covariance functions, V ( W , X ; V ) and ] ( W , X ), are given by In equation (4), the hyperparameter V consists of V and V , referred to as the variance and the length- where Gam(• | , ) denotes the gamma distribution, and its density function is defined by where Γ is the gamma function Γ( ) = ∫ 8 ‡ ‰8& From equation (7) and Bayesian theorem, the posterior distribution of the latent variable is given by where -( 4 ) is the probability that 4 is generated by the th mixing element given and 8 .
-( 4 ) is given by Next, from equation (10) and Bayesian theorem, the conditional posterior distribution for parameters , , , is given by Since ] is a matrix that depends on , a sampling of is performed as follows using Metropolis-Hasting method [5]. When the current point is ‡8& , the candidate point * is accepted with the next probability.

Notes on the PTG Algorithm
It should be noted that PTG may produce different visualization results for each trial even under the same hyper parameter settings. Indeed, PTG with the element data produced different tables for each trial of the algorithm. This implies that PTG reached around different local maxima on the likelihood surface for each trial. PTG tries to fit lower dimensional manifolds to the shape of data cloud, and there should be multiple solutions to this. Therefore, it is expected that there are many local maxima which are separated from one other on the likelihood surface of PTG. This is not counterintuitive as there should not be a unique optimal solution for arranging elements in the new periodic table. One way to deal with this problem is to run the algorithm multiple times under the same hyperparameter settings and enumerate multiple visualization results. The final result is then selected from the list of the obtained tables based on some selection criterion. In Step 1 of PTG with the elements' data, it was observed that the learning of the model became unstable and was terminated when the non-information prior distribution was used as prior distribution of the precision . To address the problem, a prior distribution of with a small scale and a sufficiently large rate was used. This prior distribution keeps the variance 8& estimated from the posterior distribution larger than a certain value, and it made the learning stable.
In the next section, we introduce details of the analysis procedure and hyper parameter settings used in this study.

Details of Analysis Procedure
We performed PTG on two different node layouts namely, square and three-dimensional conical layouts.
In the square layout of = 2, we set = 25 in the first step of PTG in which the 5 × 5 nodes were , the number of iteration in MCMC was set to = 10,000 with the burn-in step ± = 5,000, the number of iteration in the third step of fine-tuning was set to = 10, and PTG was run 10 times under the same hyper parameter settings written above.
To quantitatively evaluate the quality of the periodic tables obtained by PTG with the same hyper parameter settings and different trials, we considered using a table as an element descriptor in machine learning tasks. The modelling procedure and the data set that was used is the same as the one written in the section of 'Quantitative comparison of periodic tables'. We performed the five-fold cross-validation on the 12,373 samples for the obtained 10 periodic tables. The prediction errors for the 10 periodic tables are shown in Fig. S6

Other Examples
Additionally, we performed the PTG on another three different layouts of nodes: rectangle, three-