Disentangling Jenny’s equation by machine learning

The so-called soil-landscape model is the central paradigm which relates soil types to their forming factors through the visionary Jenny’s equation. This is a formal mathematical expression that would permit to infer which soil should be found in a specific geographical location if the involved relationship was sufficiently known. Unfortunately, Jenny’s is only a conceptual expression, where the intervening variables are of qualitative nature, not being then possible to work it out with standard mathematical tools. In this work, we take a first step to unlock this expression, showing how Machine Learning can be used to predictably relate soil types and environmental factors. Our method outperforms other conventional statistical analyses that can be carried out on the same forming factors defined by measurable environmental variables.


Disentangling Jenny's Equation
Appendix A SOM Learning Specics In Fig. S1 we illustrate the training process of a SOM with a one-dimensional dataset consisting of colors.Basically a SOM specializes areas of your map in data (shown as colors in the example) through an adaptive process.The result is a clusterized map in which each cluster in the SOM represents a set of related observations.The distance between the unit weights, also known as codes or codebooks of each unit k in the SOM, and an observation x j , d(x j , ω k ) can be dened in several ways; for classications of numerical data, the Euclidean distance is usually used.Moreover, the parameters dening a SOM are the following: unit network topology decreasing learning function f α BMU neighbourhood radius learning epochs (number of times that the whole data is presented to the network) With the scheme shown in Fig. S1 the SOM is normally used for unsupervised classication, since there is no way, in principle, to label the resulting clusters.However, it is possible to use SOMs also in a supervised way by including an additional layer of neurons.This multi-layer SOM (MLSOM or Fig. S1 Self-Organized Map training phase (schematic): 1) First, the map is initialized by assigning a random weight ω k (color in the example) to each unit k, 2) then a randomly selected observation is presented to the map, and the best matching unit (BMU) is chosen, 3) every unit in the BMU neighborhood minimizes the dierence between its weight and BMU's weight through a distance function, 4) then the process is iterated and both the learning function and the neighborhood decrease over time.A detailed description is found in Alg. 1 superSOM) allows to nd classes when observations (soil proles in our case) include a label (diagnostic horizons).In this case, a weighted distance over all layers is calculated to determine the winning units during training [1] where λ ∈ R and η ∈ R are two weights determining how the numerical and categorical layers contribute to the distance.Usually, the Euclidean metric for d(x j , ω k ) and the Tanimoto distance for h(x j , ω k ) are used, as it is done in the present work.
Appendix B Model Tuning Schematic Each sample is used to train a SOM with 3/4 of the data of that piece, using the generalised distance (A1), (2).The remaining 1/4 of the data in the piece is mapped to the resulting SOM, nding the accuracy of the prediction.This is repeated, (3), for each sample and the average performance across all hold-out predictions are calculated (details in Fig. S3).Using a grid search technique, the optimal set of parameters is found.Finally, the unlabelled test data, (4) is used to nd the prediction accuracy of the winner model, (5).
In the training phase of this workow -steps 2) and 3)-we implement data-resampling through the bootstrap method described in Fig. S3.
Disentangling Jenny's Equation no signicant partial correlations between the group of epip variables and the group of subsup variables (black rectangle marked in the gure).However, our network analysis (see Fig. 2 in main text) was able to establish a relationship between epipedon and subsoil factors.
To nd the associations between the original variables we do the following Disentangling Jenny's Equation (see Fig. S5).For each group (e.g.litos) we take the maximum absolute value of the partial correlation pcor outside its o-diagonal group and identify the corresponding original variable.As a result of this method, we obtain the  Correlations between epip and subsup are not shown as they are not signicant.With the exception of the relationship between slope and prole, the correlations shown here are all contained in the network analysis (Fig. 2 in main text).
of the relationship between slope and prole, the correlations shown here are all contained in the network analysis (Fig. 2 in main text).On the other hand,

Fig. S2
Fig.S2Best model tuning and classication.For each SOM parameter vector combination θ,(1), 100 data sets are extracted at random being replaced with data from the training set.Each sample is used to train a SOM with 3/4 of the data of that piece, using the generalised distance (A1), (2).The remaining 1/4 of the data in the piece is mapped to the resulting SOM, nding the accuracy of the prediction.This is repeated, (3), for each sample and the average performance across all hold-out predictions are calculated (details in Fig.S3).Using a grid search technique, the optimal set of parameters is found.Finally, the unlabelled test data, (4) is used to nd the prediction accuracy of the winner model, (5).

Fig. S3
Fig. S3 Data re-sampling in the training phase.First the data is divided into training and testing chunks 1).Then, for each model parameter combination, n samples are obtained by shuing the training piece 2).For each parameter, the mean accuracy is computed over the corresponding models (acc1, . . ., accn) and the winner model is selected 3).Appendix C AlgorithmsIn this section we present the two algorithms used for training and classication of the SOM.

Fig
Fig.S5Illustration of how we obtain the partial correlations for the original variables from the values obtained using dummied variables.For each (e.g.litos) we take the maximum absolute value of the partial correlation pcor outside its o-diagonal group and identify the corresponding original variable

Fig. S6
Fig. S6 Partial correlation values among the forming factors.Note that the matrix is not symmetric as a result of the method we have used to derive the pcor values.(see Fig.S5).Correlations between epip and subsup are not shown as they are not signicant.With the exception of the relationship between slope and prole, the correlations shown here are all contained in the network analysis (Fig.2in main text).