Unsupervised vector-based classification of single-molecule charge transport data

The stochastic nature of single-molecule charge transport measurements requires collection of large data sets to capture the full complexity of a molecular system. Data analysis is then guided by certain expectations, for example, a plateau feature in the tunnelling current distance trace, and the molecular conductance extracted from suitable histogram analysis. However, differences in molecular conformation or electrode contact geometry, the number of molecules in the junction or dynamic effects may lead to very different molecular signatures. Since their manifestation is a priori unknown, an unsupervised classification algorithm, making no prior assumptions regarding the data is clearly desirable. Here we present such an approach based on multivariate pattern analysis and apply it to simulated and experimental single-molecule charge transport data. We demonstrate how different event shapes are clearly separated using this algorithm and how statistics about different event classes can be extracted, when conventional methods of analysis fail.


Supplementary Figure 5 -Cost function versus the number of clusters for OPE:
Clustering algorithms aim to minimize a cost function associated with the total residual distance that all observations have to their cluster centres. There is usually no exact solution for the number of clusters with unlabelled data. Plotting the cost versus the number of clusters can give an indication of the optimal number of clusters ("elbow" in the graph).

Multiple Molecules in Junction
As discussed in the main text, it has been established in the past that there is the possibility of more than one molecule bridging a junction. It is assumed in this case that the measured plateau current is an integer multiple of the single molecular plateau current k·I p , corresponding to the number of molecules k bridging the gap. We investigated the effect on the (MPVC) by creating I(s) traces with conductance plateaus at integer multiples of I p . To account for the junction formation probability (JFP), the probability of creating a junction with k molecules (i.e. higher conductance) was assumed to be (JFP) k . In the low noise data set, panel A, the separation with MPVC is straight forward with ΔX. Both the plain exponential cluster and the clusters corresponding to multiple junction formation can easily be separated. Upon introduction of a STDEV to I p, the clusters elongate. Still, however, separation is straight forward. When a high noise STDEV is introduced (panel C), the clusters spread out. The 1·I p and the plain exponential cluster start overlapping, but can still be separated using h r , as demonstrated in the main text. In all three cases multiple molecules in the junction can clearly be isolated using the MPVC approach.

MPVC with Density Clustering
Clusters with highly irregular shapes can be challenging to isolate with histogram based methods. It can be beneficial to use density based clustering in those cases. The data in Supplementary Figure 2 B was analysed with the density based clustering approach described by Rodriguez  Using this density based clustering algorithm, all data points are correctly assigned to their cluster.
The disadvantage of using the density based clustering algorithm is, that a manual selection needs to be taken in the decision graph. The advantage is that the number of clusters is given.

Cluster number
In the manuscript, the Gustafson-Kessel fuzzy c-means (GK FCM) algorithm was used to cluster data.
The GK FCM is an adaptation of the Fuzzy c-means algorithm that includes a covariance matrix, to allow for the clustering of advanced cluster shapes. It iteratively optimizes a given cost function to optimally assign observations to different clusters. 2,3,4,5 One of the downsides of the GK FCM, however, is that the number of clusters must be selected manually. Often with unlabelled data in machine learning algorithms the number of clusters is ambiguous and cannot be calculated exactly, so one needs to rely on visual confirmation or semi-accurate methods like the "elbow-method", where the cost is plotted as a function of the cluster numbers (Supplementary Figure 5).

Supplementary Note 4 Reference Vector and Limitations
The choice of the reference vector influences the position and to some extend shape of the data in the polar representation. To illustrate this point, a simulated data set was generated with 1000 graphs, 20 % of which contained plateaus (Supplementary Figure 1, green data). The break-off distance of the plateaus was varied with a STDEV of 1 nm, leading to a long narrow cluster in the polar plot. The data set was subsequently analysed using three different reference vectors with exponential decay coefficients of 0.5 Å -1 , 1 Å -1 and 5 Å -1 (dark to light green in Supplementary Figure   1). The nature of the distribution does not change significantly in that the plain exponential cluster is at the base of the distribution (blue).
To point out limitations of the method, a data set was generated consisting of 3000 graphs, 1000 each containing plateaus at 0.5 nA, 2.5 nA and 5 nA with a s b STDEV of 1 nm (Supplementary Figure   12). Plateaus with a wide range of currents and break-off distances lead to long, narrow, curved clusters. The clusters, although fairly obvious to the human eye, bring histogram based clustering methods and k-means clustering methods to their limits. In these cases density based methods might be advantageous.