Unsupervised discovery of thin-film photovoltaic materials from unlabeled data

Quaternary chalcogenide semiconductors (I2-II-IV-X4) are key materials for thin-film photovoltaics (PVs) to alleviate the energy crisis. Scaling up of PVs requires the discovery of I2-II-IV-X4 with good photoelectric properties; however, the structure search space is significantly large to explore exhaustively. The scarcity of available data impedes even many machine learning (ML) methods. Here, we employ the unsupervised learning (UL) method to discover I2-II-IV-X4 that alleviates the challenge of data scarcity. We screen all the I2-II-IV-X4 from the periodic table as the initial data and finally select eight candidates through UL. As predicted by ab initio calculations, they exhibit good optical conversion efficiency, strong optical responses, and good thermal stabilities at room temperatures. This typical case demonstrates the potential of UL in material discovery, which overcomes the limitation of data scarcity, and shortens the computational screening cycle of I2-II-IV-X4 by ~12.1 years, providing a research avenue for rapid material discovery.


INTRODUCTION
Solar energy is the most important basic energy among all types of renewable energy 1,2 . The technologies that convert solar energy to electrical power (such as photovoltaic (PV) generation and photoelectrochemical generation, will receive more attention in multi-functional clean energy sources [3][4][5] . Practical thin-film PV cells are based on quaternary chalcogenides (I 2 -II-IV-X 4 ) of sphalerite crystals such as CdTe and Cu(In, Ga)(S, Se) 2 (CIGSSe), which is cheaper to process and exhibit competitive performance levels compared to conventional crystal silicon-based PVs. Their battery power conversion efficiency (PCE) exceeds 20% at present 6,7 . However, these materials require expensive or rare elements (In, Te), or even toxic (Cd), severely limiting their largescale development. Kesterite Cu 2 ZnSn(S, Se) 4 (CZTSSe), is a potential thin-film PV material, in which the In and Ga in CIGSSe are replaced with Zn and Sn, and its record PCE of 12.6% is significantly lower than that of CdTe/CIGSSe 8 . One possible reason for this is the antisite disorder in the kesterite structure, which significantly affects the open-circuit voltage and device performance. CZTSSe, where the smaller Zn of is replaced by Ba with a larger ionic radius (Cu 2 BaSnS 4-x Se x (CBTSSe)) to ease the antisite disorder, has demonstrated better performance in PV in comparison with CZTSSe 9-12 . Continuing this process, replacing Cu with Ag in CBTSSe to form Ag 2 BaSnSe 4 (ABTSe), and replacing Ba with Sr to form Cu 2 SrSnSe 4 (CSTSSe), yields materials that have recently shown great promise with respect to thin-film PV applications 13,14 .
These cases enlighten us that it is worth exploring the wider space of I 2 -II-IV-X 4 chalcogenide semiconductors (where I-, II-, and IV-sites are occupied by the different oxidation states of the cations and X-site is a chalcogenide anion) [15][16][17] , to identify earthabundant, environmentally friendly thin-film PV materials inspired by existing compounds. A possible path toward the discovery of I 2 -II-IV-X 4 materials is to synthesize or theoretically calculate the properties of massive sets of potential structures in terms of element substitution (e.g., I = Li + , Cu + , or Ag + ; II = Ba 2+ or Sr 2+ ; IV = Sn 4+ or Ge 4+ ; X = O 2− , S 2− , or Se 2− ), and then screen for materials with good electro-optical properties. For example, semiconductors with band gaps (E g ) (particularly with direct band gaps) in the visible wavelength region (0.9-1.6 eV, the range of optimum optical conversion efficiency), and strong optical responses in the visible spectrum, are considered promising thinfilm PVs 18,19 . Unfortunately, such an approach is impractical because of the high costs and long cycles of the necessary experimentation and is also not amenable to high-throughput computing due to excessive computational costs. Finding a way to quickly discover I 2 -II-IV-X 4 chalcogenide semiconductors is an important challenge for current research, which is of great significance for identifying thin-film PVs and further improving PCE.
With the rise of machine learning (ML) applications, data-driven approaches to material design and selection have promoted the development of materials science. ML methods extend far beyond the limitations of other current electronic structure analysis methods, to investigate novel, emergent phenomena originating from the complexity of the physical systems [20][21][22][23] . ML technologies, such as deep neural networks (DNNs) 24,25 , support vector machines (SVMs) 26,27 , and random forest (RF) 28 31 . Moreover, there are some remarkable reports on the application of ML methods in PV materials 23,[32][33][34] . However, the applications of ML in these systems use supervised learning, the biggest imperfection of which is that it still requires an adequate data set to ensure the accuracy of predictions. According to our current knowledge base, the data set of I 2 -II-IV-X 4 chalcogenides is still quite scarce, and existing supervised learning models are unable to predict properties based on structures, and a relevant ML model has not been reported with such a small data set. Consequently, a more sensible strategy to overcome the limited data available is urgently needed.
In this study, we proposed an unsupervised learning (UL) model with unlabeled data and apply it to a representative case of exploring the I 2 -II-IV-X 4 chalcogenides for thin-film PV materials. Based on the structure of I 2 -II-IV-X 4 chalcogenides and recombination of elements from the periodic table, a total of 2700 structures (containing 27 identified materials) with four different space groups were selected as the initial data set, and 1520 candidates were screened out based on the tolerance factor. We used an agglomerative hierarchical clustering (AHC) algorithm 35 to accomplish UL, and proposed a descriptor representing the sums and differences of elemental properties to cluster I 2 -II-IV-X 4 chalcogenides. Our unsupervised model clusters I 2 -II-IV-X 4 chalcogenides into one group with suitable E g , while the other groups of materials had larger E g values. Based on the high-precision Heyd-Scuseria-Ernzerhof calculations (HSE06) method, we quantitatively calculated the E g and optical absorption coefficient 36,37 of the selected compounds, and successfully discovered eight I 2 -II-IV-X 4 chalcogenides with good electro-optical properties (Ag 2 Ba-TiS 4 , Ag 2 BaTiSe 4 , Ag 2 BaCrS 4 , Ag 2 BaSiSe 4 , Ag 2 BaZrS 4 , Ag 2 BaZrSe 4 , Ag 2 BaHfSe 4 , and Cu 2 BaMnSe 4 ). We further demonstrated that these chalcogenides have good thermal stabilities at room temperature using ab initio molecular dynamic (AIMD) simulations. The proposed AHC-UL model bypasses the challenge of data scarcity in traditional ML methods and effectively avoids extremely long computational and experimental cycles. Based on the recombination of all elements in the periodic table, the proposed model reduces the screening period of I 2 -II-IV-X 4 chalcogenides by~12.1 years. We hope that the eight I 2 -II-IV-X 4 chalcogenides proposed from 2700 unknown compounds will be served as promising thin-film PVs to significantly improve PCE.

Workflow of material discovery
The workflow for unsupervised discovery of I 2 -II-IV-X 4 chalcogenides for thin-film PVs is illustrated in Fig. 1, including four modules: determination of crystal structures (Fig. 1a-d) 38 , element selection from the periodic table (Fig. 1e), the establishment of the ML model (Fig. 1f), and ab initio calculation (Fig. 1g). In this procedure, considering that the I 2 -II-IV-X 4 chalcogenide has four different space groups, I222, P3 1 , Ama2, and I42m, we selected one structure from each space group as initial structures, as shown in Fig. 1a (see elemental sites and structures in Supplementary Fig.  1). Then, the proper site elements were selected according to the oxidation state and coordination number from the periodic table, where I = Li + , Na + , K + , Cu + , or Ag + ; II = Ca 2+ , Sr 2+ , Ba 2+ , Eu 2+ , or Step 1 Step 2 Step 3 Step 4   Fig. 1e. Thus, 2700 different structures (675 compounds without considering four space groups) were generated as the initial data set for the ML module. We used the tolerance factor (T f ), to make preliminary judgments regarding the structural stabilities 16 , leaving 1520 I 2 -II-IV-X 4 chalcogenides to be further studied (380 compounds without considering space groups). Next, strong relationships between compounds were established by feature engineering, and the 380 structures were clustered based on the AHC-UL algorithm. Finally, 26 candidates covering two space groups (I222 and P3 1 ) were selected from one of the ten groups, corresponding to Fig.  1f. As shown in Fig. 1g, ab initio calculations were performed to predict the electro-optical properties and evaluate the thermal stabilities of the 26 candidates, and eight I 2 -II-IV-X 4 chalcogenides were identified as promising thin-film PV materials. The step-bystep screening process is discussed in Supplementary Fig. 2 and Supplementary Note 1.

Unsupervised learning of I 2 -II-IV-X 4 chalcogenides
The process used here for UL is shown in Fig. 1f, consisting of three parts: data set, feature engineering, and algorithm 39,40 . First, there are five, five, nine, and three elements to choose for the I-, II-, IV-, and X-sites, respectively, which can form 675 different compounds, and 2700 different structures with four space groups.
Then, T f (T I and T IV ) were applied to make a preliminary judgment about the structural stabilities of these compounds. As shown in Fig. 2a, the T f plot of 675 compounds distinguished by X-site is presented, where 380 compounds with 0.94 < T I < 1.22 and 0.84 < T IV < 1.11 (cyan area in Fig. 2a, see Supplementary Fig. 3) are potentially stable (see more details of T f in Methods). Thus, a data set of 380 I 2 -II-IV-X 4 compounds without considering space groups were selected for the next UL clustering. Feature engineering, which can transform raw random data into model training data to be closely related to the output attributes, and determines the upper limit of the ML model. The goal of this work is to determine I 2 -II-IV-X 4 chalcogenides with good electrooptical properties. The band gap E g is a basic parameter of electronic properties, which needs to be considered first. Therefore, we need to build a feature set to create a strong relationship between the compounds and electronic properties. The factors affecting the E g are complex, but to train the model, the feature set must be limited. From previous works 21,41-43 , we found that the elemental properties of materials have good mapping relationships with their band gaps. Therefore, nine elemental properties were selected, including the atomic number (Z), group number (g), covalent radius (R cov ), and first ionization energy (E ie ). The list of all elemental properties is provided in Supplementary Table 2. To construct the feature vectors, we proposed a 10 Table 1). The color bar shows the scale of band gap. descriptor using the sums and differences of elemental properties (SDEPs for short) for I 2 -II-IV-X 4 compounds, and constructed a 72dimensional SDEP in this way for each I 2 -II-IV-X 4 compound (see Methods and Supplementary Fig. 4). The feature plots of the six selected compounds (Cu 2 EuSiS 4 , Li 2 BaGeS 4 , Ag 2 BaGeS 4 , Cu 2 BaSnSe 4 , Cu 2 BaSnS 4 , and Ag 2 BaSnSe 4 ) are shown in Fig. 2b. As the input of clustering, these feature curves look very similar and play an important role in clustering results. From Fig. 2b, the feature curves of the selected six structures are very similar, but there are still significant differences (see dotted boxes), such as the tenth and 20th features, which are also the key to clustering.
Based on SDEPs, the AHC-UL algorithm 35,44 was used to cluster the 380 I 2 -II-IV-X 4 compounds. The bottom-up tree diagram (dendrogram) generated by the AHC is presented in Fig. 2c, where an appropriate partition line is selected and the 380 compounds are classified into ten groups (from G 1 , G 2 , …, to G 10 ). The different colors in Fig. 2c correspond to different groups. More details about the position of the partition line are discussed in Supplementary Note 2 and Supplementary Fig. 5. The grouping shows a good quality of clustering as different groups are well differentiated, and the SDEPs share similar characteristics within the same groups ( Supplementary Fig. 4). Details of the 72dimensional feature vector for each compound are provided in Supplementary Table 3. Therefore, a visible clustering of I 2 -II-IV-X 4 compounds can be found using AHC. From Fig. 2c, d, most of the known I 2 -II-IV-X 4 compounds with E g < 2.0 eV are clustered into G 3 in the dendrogram (Fig. 2d), including ten structures of Ag 2 BaGeSe 4 (E g = 0.85 eV), ABTSe (E g = 1.42 eV), Cu 2 BaGeSe 4 (E g = 1.88 eV), and CBTSSe (E g = 1.96 eV), etc. In addition, G 1 , including Li 2 BaGeSe 4 (E g = 2.4 eV) and Li 2 BaSnS 4 (E g = 3.07 eV), are all known compounds with E g > 2.0 eV. For G 5 and G 8 , each contains only one known compound with E g > 2.0 eV, Li 2 EuGeSe 4 (E g = 2.54 eV), and Li 2 PbGeS 4 (E g = 2.41 eV), respectively. Thus, as confirmed by the above correlations between the groups and band gaps, our proposed SDEPs and UL model can capture the chemical and physical relations of the electronic properties of I 2 -II-IV-X 4 chalcogenides. Besides, we also performed the K-means method to cluster I 2 -II-IV-X 4 compounds, the comparison of AHC and K-means are provided in Supplementary Fig. 6, Supplementary Table 4, and Supplementary Note 3.

Physical insights from unsupervised learning
The clustering of I 2 -II-IV-X 4 chalcogenides by SDEPs provides physical insights into the understanding of compounds exhibiting useful stabilities, proper electronic properties, and suitable crystal structures. We counted the number of compounds in each group, including both unknown compounds and 27 known compounds (Fig. 3a). G 1 and G 10 had the most compounds at 52 each, while G 2 and G 6 contained the fewest compounds at 16 each, indicating  that a targeted study of these groups would significantly narrow down the initial scope, which contained 380 compounds with four space groups. In addition, we generalize and summarize the ratios of known compounds in these groups ( Fig. 3a and Supplementary  Table 4). Remarkably, G 3 contains 36 compounds (Supplementary  Table 4), of which ten are known compounds, accounting for 27.8%, which is much higher than the ratios of known compounds in other groups. In terms of stability, the structures in G 3 are likely to be more stable, as the 27 known compounds have been experimentally synthesized. This is an obvious implication that the unknown 26 compounds in G 3 deserve further targeted study. As shown in Fig. 3b, the violin plot of the ten known compounds in G 3 showed a significantly lower E g , with an average of 1.75 eV and a median of 1.80 eV, and the majority of the 17 known compounds outside of G 3 have significantly higher E g , with an average of 2.27 eV and a median of 2.18 eV. This shows great potential for the discovery of I 2 -II-IV-X 4 compounds with lower E g among the 26 unknown compounds in G 3 with good electrooptical properties. Moreover, from Supplementary Table 5 and Supplementary Fig. 7, we found that the 36 compounds in G 3 showed the same I-site of Ag + or Cu + , II-site of Ba 2+ , and X-site of S 2− or Se 2− , revealing the dependence of their stabilities and electronic properties on their elemental properties. T hese discoveries from UL have important guiding significance for the design of I 2 -II-IV-X 4 compounds with stable and good electronic properties.
In addition to the discovery of elemental and electronic properties, we also found patterns in the crystal structures of the four space groups. The crystal structures of five known compounds with I = Ag + and II = Ba 2+ are shown in Fig. 3c. These structures contain two kinds of similar space groups: I42m (40%) and I222 (60%), where the I-X 4 tetrahedra are flattened and share edges with the II-X 8 dodecahedra, do not bear any resemblance to the square antiprisms observed in the P3 1 and Ama2 structures ( Fig. 1a and Supplementary Fig. 1). In Fig. 3d, where the other five known compounds with I = Cu + and II = Ba 2+ are presented, they happen to contain two other types of space groups, Ama2 (20%) and P3 1 (80%). These phenomena indicate that the ions at the I-site have an effect on the crystal structures, and the stable structures can be hopefully obtained when different chemical formulas are combined with suitable space groups (e.g., when I = Ag + , the space group of I222 is better, while for I = Cu + , the space group of P3 1 is better). To further study the 26 unknown compounds in G 3 and narrow down the screening, we focused on two main space groups; when Ag + is at the I-site, we selected I222, while when Cu + is at the I-site, we selected P3 1 . As a result, the scope of exploration narrowed from 1520 structures to 26 structures (see the list in Supplementary Table 5).
From Supplementary Fig. 4, the dashed boxes show the significant different features in G 3 , which are 10th, 19th, 56th features, indicating that the difference of dipole polarizability between I, II, and IV-site elements and the difference of atomic number between I, or II, or IV and X-site elements are important to electronic property of the I 2 -II-IV-X 4 compounds (Supplementary Table 3). In addition to speeding up the screening process, UL can estimate important features for guiding the designing of highperformance I 2 -II-IV-X 4 compounds. Therefore, the supervised learning method may be further developed to accurately analyze the importance of features.

Electro-optical properties of eight I 2 -II-IV-X 4 chalcogenides
We performed ab initio calculations to predict the electro-optical properties of the 26 I 2 -II-IV-X 4 structures in G 3 . First, geometric structure optimizations were performed, and the crystal structures remained in good geometric arrangements. To achieve high precision prediction, we used the high-level HSE06 calculation, which is considered to be close to the experimental results 37,[45][46][47][48] , to calculate the electronic and optical properties of the screened 26 candidates. From Table 1 and Fig. 4, 20 structures were determined to be semiconductors (E g > 0), with direct ( D ) or indirect ( I ) band gaps. The band structures and density of states   Fig. 4. We found that when I = Ag + , CBMs and VBMs tend to be symmetric and have smaller E g , which may lead to more information for applications such as photocatalysis and bipolar tubes, in addition to thin-film PVs. When I = Cu + , most CBMs and VBMs are asymmetric, and CBMs are generally farther from Fermi level than VBMs. For IV-site, taking Ti, Zr, and Hf as an example, we found that with the increase of atomic number, the positions of the CBMs and VBMs deviate from the Fermi level, leading to the wider band gaps. The corresponding DOSs are also presented in Fig. 5 and Supplementary Figs. 8-23, where CBMs are dominated by I-and X-site atoms, while VBMs are mainly influenced by IV-site atoms. All the CBMs and VBMs show that II-site atoms (Ba 2+ ) do not dominate the scene, which is consistent with the fact that IIsite atoms are indistinctive in the 26 unknown structures (which are all Ba 2+ ) from G 3 . The above results indicate the importance of elements in regulating CBMs and VBMs to impact the E g of I 2 -II-IV-X 4 structures, and UL can effectively cluster similar elemental and conductivity characteristics into one group.
It is worth noting that there are four structures (1.42 eV for Ag 2 BaTiS 4 , 1.18 eV for Ag 2 BaTiSe 4 , 1.33 eV for Ag 2 BaSiSe 4 , and 1.60 eV for Ag 2 BaZrSe 4 , as shown in Fig. 4) having E g between 0.9 and 1.6 eV, which is the range of optimal optical conversion efficiency 18 . In particular, Ag 2 BaTiS 4 , Ag 2 BaTiSe 4 , and Ag 2 BaSiSe 4 have direct E g , this means that the electron transitions do not require phonon release or absorption; As a result, electrons and holes are more likely to recombine. They may be used as potential thin-film PVs. Moreover, Ag 2 BaCrS 4 , Ag 2 BaZrS 4 , Ag 2 BaHfSe 4 , and Cu 2 BaMnSe 4 have band gaps around 0.9 or 1.6 eV, are also likely to have high optical conversion efficiency, we also took them into account. Since PV suitability is the primary motivation for examining these properties, more analysis of the optical properties of these eight structures is necessary. The calculated absorption coefficients (α) of these eight structures based on the HSE06 functional are presented in Fig. 6 and Supplementary  Fig. 24. They all show strong optical responses (α > 10 5 cm −1 ) in the visible spectrum (1.65-3.26 eV, the colorful background in Fig.  6), and the absorption coefficients are largely isotropic in this range, showing only minor variations among || a, || b, and || c directions, potentially indicating that there is only a small performance dependence on film orientation for thin-film PVs. The optimal band gaps and desired optical absorptions of the eight I 2 -II-IV-X 4 chalcogenides show that they have great promise as thin-film PVs and exhibit good performance.
In addition to the eight outstanding candidates for thin-film PVs, the proposed method screens out 12 other I 2 -II-IV-X 4 chalcogenide semiconductors that are far away from the range of 0.9-1.6 eV (the corresponding band structures and DOSs are provided in Supplementary Figs. 8-23), which are also likely to play important roles in PV devices, even in other fields (photocatalysis, sensors, detectors, etc.). There are many suggested improvements to properly adjust their band gaps or optical properties to achieve superior performance.
Our UL model not only overcomes the problem of data scarcity, but also greatly shortens the cycle of I 2 -II-IV-X 4 chalcogenide discovery, positively differing high-throughput calculations by previous works 49,50 . In this work, 27 known structures were excluded from the initial 1520 I 2 -II-IV-X 4 chalcogenides, and 26 candidates were finally screened out. In terms of computational screening, each structure required 260,464 s on a 24-CPU supercomputer ( Supplementary Fig. 25) through the high-precision HSE06 method, meaning that the present work saves~12.1-year computational cycles for 1467 structures. This will provide research export and method for the next generation of thin-film PVs. In addition, structural features also have an important impact on the band gap prediction 51 , but we did not consider them in the clustering due to the lack of data. With the further development of high-throughput computing, more and more I 2 -II-IV-X 4 materials will be available, at which point we will be able to establish supervised learning models to accurately predict band gaps using classification or regression methods.

Thermodynamic stabilities of eight I 2 -II-IV-X 4 chalcogenides
For the eight I 2 -II-IV-X 4 chalcogenides with good electro-optical properties, their thermodynamic stability should be evaluated in addition to the preliminary stability determined by UL and geometric structure optimization in order to facilitate their practical application. Therefore, we performed AIMD to evaluate the thermodynamic stabilities of the screened I 2 -II-IV-X 4 candidates. As shown in Fig. 7 and Supplementary Fig. 26, the total energies of all the systems fluctuate within a very small range without a clear drop or rise during the simulations at 300 K. The crystal structure snapshots were extracted by an interval of 1.0 ps, without obvious expansion or contraction. Moreover, formation enthalpy is an important criterion to test crystal stability 34,52 , thus we also calculated the formation enthalpies for eight I 2 -II-IV-X 4 candidates with two host space groups of I222 and P3 1 . As shown in Supplementary Fig. 27, all structures with a space group of I222 show the lower formation enthalpies than the structures with space group of P3 1 , especially seven structures with I = Ag + show the lower formation enthalpies than experimental Ag 2 BaSnS 4 , indicating their good stability. It is noting that the selected Cu 2 BaMnSe 4 with a space group of P3 1 has a slightly higher formation enthalpy than Cu 2 BaMnSe 4 with a space group of I222 and experimental Cu 2 BaSnS 4 , which may be metastable since the small difference in formation enthalpy and its smooth energy fluctuations during AIMD. The individual energy values (isolated atoms and bulk structures) are provided in Supplementary Table 6.
The above results indicate that the eight I 2 -II-IV-X 4 chalcogenides can maintain the integrity of their crystal structures and good thermal stabilities at room temperature. This indicates that the eight structures selected in this work are stable. We expect that they can be further widely applied in thin-film PVs with good performance.
DISCUSSION I 2 -II-IV-X 4 chalcogenides have become important materials for thin-film PVs. The discovered I 2 -II-IV-X 4 chalcogenides meet the criteria for earth-abundance and environmental friendliness, and demonstrate great potential for improving PV performance. However, traditional approaches, such as experimental synthesis and high-throughput computing, are limited due to the long-time cycle, and even the reliable ML model is impeded by the scarcity of material property data.
In summary, our achievements are as follows: (1) We propose an accessible descriptor of SDEPs based on the isolated elemental properties to obtain the feature vector of I 2 -II-IV-X 4 compounds, which can be expanded to other material systems. (2) eight I 2 -II-IV-X 4 compounds (Ag 2 BaTiS 4 , Ag 2 BaTiSe 4 , Ag 2 BaCrS 4 , Ag 2 BaSiSe 4 , Ag 2 BaZrS 4 , Ag 2 BaZrSe 4 , Ag 2 BaHfSe 4 , and Cu 2 BaMnSe 4 ) with optimal band gaps, desired optical absorptions, and practical thermal stabilities at room temperatures were selected out of 2700 original structures by UL. They demonstrate great potential as thin-film PVs. (3) Because each structure requires an average of 260,464 s with a 24-CPU supercomputer, our method significantly reduced the scope for screening and calculation (from 2700 structures to 1520 structures, to 26 structures), dramatically shortening the computational cycle of material discovery by~12.1 years. (4) This study demonstrates the potential of UL in material discovery, thus surmounting the obstacle of data scarcity, which may lead to important ideas and methods for the future discovery of materials.
Furthermore, we hope that the eight candidates revealed in this work will be synthesized experimentally for the preparation and application of thin-film PVs. For the other 16 I 2 -II-IV-X 4 semiconductors identified, they are also of high research value in different fields according to their band gaps. In our subsequent work, we will focus on finding better descriptors to explore more precise quantitative laws of I 2 -II-IV-X 4 structures, such as the expanded Shannon radii, cell volume 33 . Meanwhile, this work is a typical case of UL in material discovery, we look forward to its vigorous development in materials science.

Tolerance factor
Tolerance factors serve as descriptors for phase stability with quaternary I 2 -II-IV-X 4 semiconductors, which can be used for structure prediction in an empirically driven learning model 16 . For I 2 -II-IV-X 4 materials, two dimensionless tolerance factors (T I and T IV ) describing the geometric relations have been derived recently. The formula is as follows, where r I , r II , r IV , and r X are the ionic radii of the I, II, IV, and X elements, respectively, and ideally T I =  Fig.  2a), and T IV ranges from 0.84 to 1.11 (the blue dashed line in Fig. 2a).

Descriptor of SDEPs
According to our previous work and relevant literature reports 21,41-43 , we built descriptors from properties of isolated atoms at the I-, II-, IV-, and X-sites. The nine properties of atomic number (Z), group number (g), covalent radius (R cov ), Van der Waals radius (R vdw ), valence-electron number (N v ), electron affinity (E ea ), dipole polarizability (D p ), first ionization energy (E ie ), and Pauling electronegativity (X) are considered in this work. For each elemental property (φ), we calculated the minima and maxima of the absolute values of the sums and differences of elemental properties (SDEPs). Succinctly, we introduce the following notations for an elemental property φ: For each of these equations, we can calculate an 18-dimensional vector based on nine properties. Therefore, a 72-dimensional feature vector can be obtained for each I 2 -II-IV-X 4 compound (as shown in Fig. 2b, Supplementary Fig. 4, and Supplementary Table 3). Compared with the 36-dimensional feature vectors obtained from the previous nine independent elemental properties (four site elements), the 72-dimensional feature vectors contain more abundant information, especially the information of differences between elemental properties.

Unsupervised algorithm
UL is accomplished by performing AHC, and the dendrogram function used in AHC is from the SciPy package 35 . In AHC, the similarity between samples is calculated by a similarity measure, and each sample is reconnected step by step in order to form nodes. The nodes are organized into a bottom-up tree diagram hierarchy (dendrogram), where the leaf nodes of the tree represent a single sample, and non-leaf nodes are generally obtained by merging similar or close sample sets. The Euclidean distance (L2) between two I 2 -II-IV-X 4 compounds was used as the similarity metric, and Ward linkage was used to measure group dissimilarity.
The advantage of AHC is that the partition can be stopped at any time, which means that the number of groups (K) can be adjusted dynamically and directly. In this study, K = 10 performs well. More details regarding K can be found in Supplementary Fig. 5 and Supplementary Note 2.

First-principles calculation
All first-principles calculations were conducted using the Vienna Ab initio Simulation Package (VASP) 53 . The Perdew-Burke-Ernzarhof (PBE) generalized gradient approximation (GGA) functionals 54 and project-augmented wave (PAW) atom potentials are employed to perform geometric structure optimizations 55,56 . In this work, for the structures with a space group of I222, the experimental Ag 2 BaSnS 4 (materials id: mp-555166) served as the starting structure for the cell optimizations; for the structures with a space group of P3 1 , the experimental Cu 2 BaSnS 4 (materials id: mp-17954) served as the starting structure, both of which were obtained from the Materials Project 38 . The cutoff energy for the plane-wave basis was set as 500 eV. The structure optimization process was ended when an energy convergence lower than 10 −5 eV and atomic force less than 0.05 eV/Å. Further, the HSE06 functional was performed for electronic structure calculations and optical properties 36,37 , and the high symmetry points of electrons were obtained from the online tool of the seek-path. More details on the calculations are provided in Supplementary Note 4. Further, to see the effect of lattice constant on the band gap 57 , we also optimized eight promising I 2 -II-IV-X 4 materials by using the high-level meta-GGA functional (strongly constrained and appropriately normed semilocal density functional, SCAN) 58,59 , then calculated the band gaps with HSE06 functional, and found that the differences in structures and band gaps are small (see Supplementary Table 7). This indicates that the convergence criterions (energy and atomic forces) we calculated are reasonable.
The AIMD process was used to evaluate the thermal stability, for structure with a space group of I222, a 128-atom 2 × 2 × 2 supercells was built, and for space group of P3 1 , a 196-atom 2 × 2 × 2 supercell was built. There are enough atoms for phase transition simulations 43,60,61 . With a time step of 1.0 fs, a total of 5 ps of kinetic processes were performed for the structures. It is noted that a longer time scale or larger system size would help to build confidence in stability conclusion but that for now, these cell dimensions will work. During this process, the temperature was controlled at 300 K using the Nosé-Hoover thermostat 62,63 .

DATA AVAILABILITY
The starting structures for DFT optimization are available from the Materials Project: https://www.materialsproject.org. The 72-dimensional feature vectors of the 380 I 2 -II-IV-X 4 compounds are available from Supplementary Materials.