Universal image segmentation for optical identification of 2D materials

Machine learning methods are changing the way data is analyzed. One of the most powerful and widespread applications of these techniques is in image segmentation wherein disparate objects of a digital image are partitioned and classified. Here we present an image segmentation program incorporating a series of unsupervised clustering algorithms for the automatic thickness identification of two-dimensional materials from digital optical microscopy images. The program identifies mono- and few-layer flakes of a variety of materials on both opaque and transparent substrates with a pixel accuracy of roughly 95%. Contrasting with previous attempts, application generality is achieved through preservation and analysis of all three digital color channels and Gaussian mixture model fits to arbitrarily shaped data clusters. Our results provide a facile implementation of data clustering for the universal, automatic identification of two-dimensional materials exfoliated onto any substrate.


Introduction
Image segmentation has become an essential technique in fields from medical imaging [1][2][3] and autonomous driving 4 to robotic perception 5 and image compression [6][7][8] .Through unsupervised segmentation of large data sets, trained algorithms can recognize and predict elements of new images.An appealing application of image segmentation is in the thickness identification of two-dimensional (2D) materials from their digital optical microscopy images.Current flake detection methods rely heavily on identification by trained researchers, a human-learning process, in which flakes are identified by their contrast difference on a substrate after significant trial and error.Automatic thickness identification would relieve this tedious, time-consuming screening process and possibly improve identification accuracy.
The easiest implementation of image segmentation for 2D materials is by thresholding.This is performed by analyzing image contrast, from reflectance or transmittance for example, and partitioning regions of an image based on contrast level difference.This technique has been widely and successfully employed in the identification and characterization of exfoliated 2D materials [9][10][11][12][13][14][15][16][17][18][19] .Thresholding techniques, while easily implemented, suffer from inaccuracy when contrast differences become relatively small, for example a single layer of graphene on a silicon/silicon dioxide (Si/SiO 2 ) substrate, and can be highly dependent on precise experimental conditions hindering universal application.
Recently, a variety of more advanced machine learning techniques have emerged to automate and improve the process of identifying exfoliated 2D materials [20][21][22][23][24][25][26][27][28][29][30] .These include techniques based on neural networks and data clustering but have been primarily applied to opaque substrates and almost entirely to standard Si/SiO 2 .Transparent substrates are commonly used for exfoliation or experiments on 2D materials 11,[31][32][33][34][35] .A method that can be applied to identify the thickness of any material on any substrate is highly desirable.
Here we present an open source program 36 written in Python 3 to automatically identify the thickness of exfoliated 2D flakes which can be universally applied to different materials and substrates.We combine three well-established clustering techniques to form a training script to segment layers of a flake, manually label the layers, and then use that training to test thicknesses of other flakes.The program presents roughly a 95% pixel accuracy for graphene and transition metal dichalcogenides on silicon/silicon dioxide and polydimethylsiloxane (PDMS) substrates.Importantly, no change to the program's adjustable parameters are needed to identify different materials on different substrates, allowing simple and universal application to any material/substrate combination.

Overview of the program
An overview of the program applied to graphene on Si/SiO 2 is shown in Figure 1.The training stage begins with a set of optical microscopy images cropped to few-layer flakes whose thicknesses are determined using optical contrast methods 9 .Figure 1(a) shows a cropped optical image of a few-layer graphene flake on a Si/SiO 2 (300 nm SiO 2 ) substrate with each layer thickness labeled (1-4 layers).A scatter plot of the red, green, and blue (RGB) channel values (normalized to a range 0-1) for each pixel (e-f) A color plot of pixel-cluster association (e) and corresponding scatter plot (f).Each pixel is colored based on its most probable data cluster identity.(g-h) Raw optical image of few-layer graphene flake with unknown thickness (g) used for testing and its corresponding RGB scatter plot (h).(i-j) Crop of panel (g) around the flake of interest (i) and corresponding scatter plot after the testing stage (j). in the image is shown in Figure 1(b).The data points have been colored according to their RGB values.At this stage, the scatter plot shows only broad features that can be generally associated with the substrate (pink) and few-layer graphene (purple) but no clear correspondence to individual thicknesses can be made.The raw image is preprocessed using a bilateral filter to reduce noise and a background normalization using a planar fit.The result, after compression to roughly 10,000 total pixels, is shown in Figure 1(c).Preprocessing reveals the individual clusters of data in the scatter plot (Figure 1(d)) associated with the substrate and flake layers.
The location and distribution of these clusters in RGB space are found using a series of unsupervised clustering techniques summarized here and detailed further below.The centers are first located using mean shift and density-based spatial clustering.Once the centers are identified, fit characteristics such as the weight, mean position, and distribution of each cluster are found using a Gaussian mixture model.An image of the result of the fitting algorithm is shown in Figure 1(e).The pixels in the color plot have been colored according to the fit results and the scatter plot (Figure 1(f)) shows these fit assignments in more detail.Once the cluster characteristics are extrapolated for several training images, a master catalogue is created that ties the fit clusters to the predetermined flake thicknesses in our training images.
To determined accuracy of the training, we test the master catalogue on a set of images with identified thicknesses.An example of the testing results for graphene on Si/SiO 2 is shown in Figures 1(g-j).In the testing stage, untrained optical images (Figure 1(g-h)) are first preprocessed using the same procedure as in training but without cropping.The preprocessed images are then checked against the master catalogue for flake thickness assignment given the pixel location in RGB space.Figure 1(i) shows the result (cropped to show detail of flake of interest) with layer thicknesses identified by the color bar.The associated scatter plot in Figure 1(j) shows the corresponding clusters and layer assignments.In the following section we detail the implemenation of the clustering algorithms before presenting results of our program performed on other materials and substrates.

Unsupervised clustering algorithms
Our training script incorporates three clustering algorithms to identify the center of the data clusters and fit their distributions.Without explicitly knowing the number of clusters (layers) in the image, the script begins with an unsupervised method of determining the seed number.We use mean shift [37][38][39] and density-based 40,41 algorithms to first find these cluster centers which are then fed to a Gaussian mixture model for fitting arbitrary ellipsoidal distributions.
Mean shift is an unsupervised machine learning algorithm that locates centers of high data density.The algorithm begins by populating color space with an array of points, referred to as "mean points" ( ρ k ). Figure 2  The next step groups all data points within a defined radius (ε) of each mean point together.We define ε to be just large enough to overlap with its nearest neighbors.The average location of the data pixels within ε of a given mean point becomes the new position of that mean point ( ρ k ) after one iteration of the algorithm.This is calculated by: where x i is the position of each data point and M is the total number of data points within ε of ρ k .In this way, each mean point gradually shifts towards higher densities of data.Figure 2(b) shows one iteration of the algorithm.Several points have moved to their new mean positions according to the data within ε of each ρ k .Mean shift is computationally slow, having to calculate the distance between every data point (≈10,000 pixels) and every mean point (initially 512).To increase efficiency, mean points that have no data within ε after the first cycle, and thus make no contribution towards a data cluster, are deleted.Additionally, mean points may approach their local maximum at different rates.Per mean point, as soon as the number of data points within their radius starts to decrease, they are turned off and no longer involved in future calculations.Figure 2(c) shows the final state of the algorithm where all mean points have converged to their local density maxima.After this, outliers are removed before moving to the next algorithm.
Once mean shift is complete, several mean points will themselves be clustered in color space and some mean points will have converged to outliers.Due to the ellipsoidal shape of the clusters in RGB space after preprocessing, the mean points will tend to lie along lines.An efficient algorithm for grouping these lines is Density-Based Spatial Clustering of Applications with Noise (DBSCAN) 40,41 .DBSCAN groups data together by following the trajectory of nearby points.The algorithm starts by "visiting" a random mean point.A radius (we find ε/2 works well) around it is checked for other mean points.If none are found, the starting point is labeled an outlier.If there are neighbors, they are grouped together.One of the other points in this group is visited next, checking the same radius around itself to find new points to add to the group.This repeats until no new points are added to the group and every point within the group has been visited.Once the group is finished, a new group starts at a randomly chosen mean point and the process repeats.The centers of each group are found by averaging their respective mean points.Figure 2(d) shows the result after running the DBSCAN algorithm on the mean points in Figure 2(c).
The combination of mean shift and DBSCAN presents an unsupervised method of determining how many clusters are in a given image and their centers.Following this step, this information can be used to seed a more powerful clustering technique for data with ellipsoidal distributions.The popular K-means clustering technique [42][43][44] for example is undesirable here as it assumes spherical clusters.Instead, we use a multivariate Gaussian Mixture Model (GMM) 2, 45-49 that allows fitting of data with arbitrary normal distributions.This expands application of the program by automatically handling new materials and substrate combinations that may have different cluster distributions in RGB space.
In the GMM, each fitting ellipsoid has three characteristics developing throughout the process: the weight (φ k , defining the number of data points near ellipsoid k), the centroid ( µ k , defining the mean of the data points belonging to ellipsoid k), and the covariance matrix (Σ k , defining the shape and orientation of ellipsoid k in RGB space).These characteristics are used to calculate the probability γ ik of a data point x i belonging to ellipsoid k.This probability is given by: where K is the total number of clusters and N is the three-variable (for 3-dimensional RGB space) Gaussian distribution given by: The weights, means, and covariance matrices used in these relations are calculated through: First we initialize each of the fitting ellipsoids by setting all initial weights to 1/K.The centroids are taken directly from the results of DBSCAN µ = ρ.The covariance matrices are initialized from the centroids using Equation 6with γ ik = 1. Figure 3(a) shows the initialization of the fitting ellipsoids for our example few-layer graphene data set from Figure 1(d) and Figure 2. The ellipsoids have been scaled to a 95% confidence level.
An unsupervised machine learning algorithm, referred to as expectation-maximization (EM), is used to further optimize the ellipsoid parameters and fit the data.The expectation step determines γ ik based on the initialized weights, centroids, and covariance matrices calculated above.The maximization step uses these probabilities to re-calculate each cluster's weight, centroid, and covariance matrix.These two steps iterate and gradually the ellipsoid parameters converge.Figure 3(b) shows the algorithm results after 2 cycles and Figure 3(c) shows the results after 30 cycles.After 30 cycles, the ellipsoids resemble the distributions of the data with several small tight ellipsoids corresponding to the substrate and 1-4 layers of graphene, and two larger ellipsoids (purple and blue) accounting for noise.The max change of all cluster's weights between maximization steps (∆φ k < 0.0001) is used to define convergence and end the algorithm.Figure 3(d) shows the results of the algorithm after convergence (total 61 cycles) for this data set.Note that the large purple and blue ellipsoids are a product of over-fitting the data (fitting 7 ellipsoids to 5 data clusters).These ellipsoids do not contribute to the master catalogue, but are important for fitting data points associated with thicker layers (> 4) and outliers.The over fitting also allows the primary ellipsoids to confine themselves to the core of their data clusters.Once convergence has been reached, only ellipsoids that fit well to known layer thicknesses are added to a catalogue.
The training process is repeated for multiple flakes of the same material and substrate (we trained around 5 for each material/substrate combination), saving their ellipsoid characteristics into the same catalogue.A master catalogue is then created by averaging together the characteristics of ellipsoids with like-thickness.This master catalogue is the tool with which we can test other images to determine their flake layer thicknesses (Figure 1(i-j)).

General application to other materials and substrates
Our script can be universally applied to the identification of other 2D material thicknesses on opaque and transparent substrates.This generality is achieved by analyzing all three dimensions of the color-space data and fitting the resulting clusters of arbitrary shape with our GMM-EM algorithm.Importantly, no change in the adjustable parameters (ε or GMM convergence) are required for the following results.
Figure 4 displays the power of this generality by identifying the layer thickness of two additional materials, molybdenum disulfide (MoS 2 ) and molybdenum diselenide (MoSe 2 ), on opaque (Si/SiO 2 ) and transparent (polydimethylsiloxane (PDMS)) substrates.MoS 2 on Si/SiO 2 (Figure 4(a-e)) presents clusters very similar to those of graphene on Si/SiO 2 but they are separated further in RGB space (Figure 4(b)).Further training for this material/substrate combination would improve our testing results which only identify layer thicknesses of 1 and 2 (Figure 4(d-e)).From the covariance matrices we note that while all the data clusters are technically triaxial ellipsoids (none of the semi-axes are equal), the clusters for materials on Si/SiO 2 are roughly prolate spheroids with one axis (blue) an order of magnitude larger than the other two semi-axes (red and green).
MoS 2 on PDMS (Figure 4(f-j)) presents clusters again extending along the blue axis, though not as strongly as materials on Si/SiO 2 .The clusters are similarly well-separated in RGB space as they are for MoS 2 on Si/SiO 2 .Testing for this set identifies monolayer, bilayer and trilayer thicknesses (Figure 4(j)).Finally, MoSe 2 on PDMS presents the most spherical ellipsoids of our investigation still slightly extending along blue, (Figure 4(k-o)) and mono-through trilayer thicknesses are easily identified (Figure 4(o)).

Discussion
Our investigation focuses on the development of a program that can be universally applied to different 2D materials and substrates.This requirement invariably introduces computation time when compared with other recent segmentation methods 22,50 .
The training time, for example, reported in Ref. 50 for the entire program is roughly 31 hours.Computation times for the training stage of our program depend on the image composition.A single layer image can take about ten minutes but images with multilayer flakes (more clusters) take as long as 5 hours.Our program results here are from training sets of roughly 10 images corresponding to about 10 hour computation time.However, this is a single event time cost because once the master catalogue is trained for a particular material and substrate combination, it can then be used repeatedly in the testing step, which is more efficient.
Image testing requires roughly one minute to identify layer thicknesses of new images.Computation time is sufficiently short for testing because image pixels are simply compared with the master catalogue.This time would allow in-situ identification of flakes from images taken by human inspection of a substrate.The time spent scanning between images can take several minutes.The time may also be sufficient for an automated scanning system such as that presented in ref. 51 .Improvements in computation time may be sought through further image compression or possibly reducing the testing step to two dimensions of the three-dimensional RGB space, possibly blue and either green or red, similar to algorithms presented in ref. 23 .Although, the dropped color dimension would have to be identified for a particular material/substrate combination.
For each material/substrate combination investigated in this study, the pixel accuracy was determined by creating a ground truth image and comparing it, pixel-by-pixel, with the testing images (see Figure S1 in the supplemental materials for details).Pixel accuracy was slightly better for materials on PDMS but overall the program achieves an average accuracy of 95% for the materials and substrates investigated in this study.This pixel accuracy is comparable to that achieved in studies based on much larger training sets.Ref. 50 reports pixel accuracy of 97% from a training set of 917 images.Based on these results, normalized confusion matrices for each combination were calculated showing the individual layer accuracy as well.Finally, we note that a clear advantage to our approach is the simplicity of our program which relies on well-known and proven clustering techniques with relatively high pixel accuracy from small training sets.

Conclusion
Summarizing, we have presented a code for the automatic identification of flake thicknesses that can be universally applied to a variety of 2D materials and substrates.The algorithm analyzes data clusters in RGB space of preprocessed optical microscopy images.It can accurately identify mono-and few-layer thicknesses with a pixel accuracy of 95%.We anticipate the program will be of use for a wide variety of materials and substrates for the continued interest and investigation into the properties and characteristics of 2D materials.

Figure 1 .
Figure 1.Overview of the program composed of training and testing stages.(a-b) Raw optical image of a few-layer graphene flake on 300 nm of SiO 2 (a) and its corresponding scatter plot of each pixel in RGB color space (b).The scatter plot data points are colored to match their RGB value.(c-d) The result after preprocessing the image in panel (a).The preprocessing reveals well-defined clusters in the associated scatter plot (d).(e-f)A color plot of pixel-cluster association (e) and corresponding scatter plot (f).Each pixel is colored based on its most probable data cluster identity.(g-h) Raw optical image of few-layer graphene flake with unknown thickness (g) used for testing and its corresponding RGB scatter plot (h).(i-j) Crop of panel (g) around the flake of interest (i) and corresponding scatter plot after the testing stage (j).
(a) shows the same data as in Figure1(d) but with red closed circles indicating the initial positions of the equally spaced mean points (an 8 × 8 × 8 array).

Figure 2 .
Figure 2. Mean shift and DBSCAN clustering for identification of cluster centers.(a) The same data as in Figure 1(d) with red closed circles showing the initial positions of the mean points in the mean shift algorithm.(b) Scatter plot after one cycle of mean shift; outlier mean points have been deleted and others have moved towards their local density maxima.(c) The final state of the mean points after they have converged to their maxima.(d) The RGB pixels plotted with the identified cluster centers (colored closed circles) after the DBSCAN algorithm.

Figure 3 .
Figure 3. Cluster fitting with a Gaussian mixture model (GMM).The scatter plot from Figure 1(d) is superimposed with 95% confidence ellipsoids based on the fit characteristics of the GMM-EM algorithm.(a) The initialized ellipsoids show little correspondence to the underlying data.(b) After two cycles of expectation-maximization, the ellipsoids better resemble the data clusters.(c) After 30 cycles, some ellipsoids have nearly converged on their data clusters.(d) The convergence condition is reached after 61 cycles.

Figure 4 .
Figure 4. General thickness identification of 2D materials on opaque and transparent substrates.(a-c) An example training process for MoS 2 flakes exfoliated onto Si/SiO 2 substrates.(a) Raw optical image before preprocessing.(b) RGB scatter plot of the identified clusters with pixels colored according to the cluster they belong to.(c) Reconstructed image of the MoS 2 flake in (a) after the training step.(d-e) Testing process for MoS 2 on Si/SiO 2 .(d) Raw optical image of an MoS 2 flake on Si/SiO 2 .(e) Layer identification after testing the image in (d).(f-h) An example training process for MoS 2 /PDMS.(i-j) An example testing process for MoS 2 /PDMS.(k-m) An example training process for MoSe 2 /PDMS.(n-o) An example testing process for MoSe 2 /PDMS.