Data driven discovery of conjugated polyelectrolytes for optoelectronic and photocatalytic applications

Conjugated polyelectrolytes (CPEs), comprised of conjugated backbones and pendant ionic functionalities, are versatile organic materials with diverse applications. However, the myriad of possible molecular structures of CPEs render traditional, trial-and-error materials discovery strategy impractical. Here, we tackle this problem using a data-centric approach by incorporating machine learning with high-throughput first-principles calculations. We systematically examine how key materials properties depend on individual structural components of CPEs and from which the structure–property relationships are established. By means of machine learning, we uncover structural features crucial to the CPE properties, and these features are then used as descriptors in the machine learning to predict the properties of unknown CPEs. Lastly, we discover promising CPEs as hole transport materials in halide perovskite-based optoelectronic devices and as photocatalysts for water splitting. Our work could accelerate the discovery of CPEs for optoelectronic and photocatalytic applications.

Recently, narrow bandgap CPEs (NBCPEs) have attracted particular attention as their backbones comprise electron-rich and electronpoor fragments, leading to intramolecular charge-transfer states and facile oxidation and reduction 15 . Remarkably, when combined with anionic sulfonate side groups, the NBCPEs could be spontaneously self-doped in aqueous media 12,[16][17][18] . In contrast, the self-doping is not present in cationic CPEs. This suggests that the interplay between the electrostatic force and the redox chemistry in the semiconducting backbone is key to the optoelectronic properties of NBCPEs, and particularly to stabilizing polaron states. Unfortunately, despite tantalizing potentials and significant interests in CPEs, systematic studies of CPEs remain scarce, especially from a theoretical and/or first-principles perspective. There is a critical lack of fundamental understanding on how electrostatic force may influence the optoelectronic properties of CPEs. It is unclear how structural components (backbone, side chain, ionic group, and counter ions) of CPEs may be individually and collectively tuned to optimize the optoelectronic properties. More importantly, it remains highly challenging to design CPEs for target applications.
In this work, we aim to tackle these challenges using a datacentric approach by combining machine learning with highthroughput first-principles calculations. Although the same strategy has been used in other types of materials [19][20][21][22][23][24] , no such effort has been attempted on CPEs. Our work is further motivated by the needs to establish structure-property relationships in CPEs.
Because all structural components, including the donor and acceptor units on the backbone, the ionic identity of the pendant groups, the length of the side chains and the choice of the counter ions can be individually tuned, a large number of structural variations could emerge, rendering the development of structure-property relationships challenging. On the other hand, the structural modularity of CPEs also makes them amenable to machine learning. Thus, it is of interest to explore whether machine learning can help navigate the highdimensional space of the structural variations and uncover much needed structure-property relationships for designing/screening CPEs. Essential to machine learning is the availability of pertinent data and in this work, we will construct a first-principles database of CPEs. In the following, we will first provide a brief description of the database and then examine how frontier orbitals, such as the highest-occupied molecular-orbital (HOMO), the lowestunoccupied-molecular-orbital (LUMO), and the HOMO-LUMO gap (E g ) of CPEs depend on each structural component. Subsequently, by means of machine learning, we will uncover structural features of CPEs underlying the energetics of the frontier orbitals. These features are then used to predict HOMO, LUMO, and E g of "unknown" CPEs beyond the database. Finally, we discover a number of promising CPEs as hole transport materials in halide perovskite-based optoelectronic devices and as photocatalysts for water splitting.
ionic group and an oppositely charged counter ion. The ionic group is responsible for CPE's solubility in polar solvents, including aqueous media. With a certain valence state, the size of the counter ion can also be adjusted 26 . Finally, the distance between the ionic functionality and the electronically delocalized backbone can be modulated by the length of the alkyl side chain. The electrostatic interaction between the ionic functionality and charge carriers (polarons) on the conjugated backbone is known to play a crucial role in various applications of CPEs. Therefore, a complete characterization of CPEs requires the spanning of a 5dimensional space of structural variations (donor, acceptor, alkyl chain length, ionic group, and counter ion) as illustrated in Fig. 1b.
The structural modularity of CPEs enables us to construct a database by expanding the five-dimensional space via highthroughput computation. At present, the database contains over 2000 CPEs whose properties (e.g., structural, electronic, optical, and dielectric properties) have been obtained via first-principles calculations (Supplementary Notes and Methods). Here, we focus on anionic CPEs with their anionic groups consisted of COO, SO 3 , and PO 3 H, and the counter cations being Li, Na, K, and Cs. Three donor units (CPDT, P, PDT) and nine acceptor units (BT, PT, BBT, FBT, Ph, PhF, Py, PhCN, Tp) are combined to form 27 distinct conjugated backbones for CPEs, as shown in Fig. 1b. The HOMO and LUMO levels of these donor and acceptor units are summarized in Supplementary Table 4. Four different lengths of the alkyl chain (C4, C6, C8, C10) are also included. In sum, 1296 (=3 × 4 × 3 × 9 × 4) anionic CPEs are considered in the present work.
Establishing structure-property relationships HOMO/LUMO levels and E g are three most important properties of CPEs in photovoltaic and photocatalytic applications, thus are the focus of our attention. HOMO and LUMO levels are often used as proxies for ionization potential and electron affinity, which can be measured experimentally. Note, that E g is the HOMO-LUMO gap (or transport gap), shown schematically in Fig. 2a. We have systematically examined the HOMO/LUMO levels and E g for all 1296 anionic CPEs and their values are summarized in Fig. 2b. The HOMO/LUMO levels are shifted relative to the vacuum level so that they can be compared among the CPEs. The vacuum level is defined as the potential energy at the vacuum of the simulation box ( Supplementary Fig. 3). PBE functional was used in these calculations to extract the general trends. The horizontal axis   (3), counter ion (4) and length of alkyl chain (4). The first observation drawn from Fig. 2b is that for each backbone, the variation (~0.5 eV) in HOMO/LUMO level is much larger than that (~0.1 eV) in E g . This suggests that in contrast to HOMO and LUMO, E g is primarily determined by the backbone. However, the electrostatic force due to the ionic functionality can significantly modulate HOMO and LUMO for a given backbone. The second observation is that while many CPEs here have narrow bandgaps (E g < 2 eV), some exhibit bandgaps wider than 2.5 eV. These wider gap CPEs all possess P as their donor unit (and Ph, PhF, and Py as their acceptor units) on the backbones. The lowest transport gap of these CPEs is 1.32 eV. Thus, in order to design NBCPEs, we need to discover donor/ acceptor combinations in the backbone; tuning ionic functionality alone would not reach the goal. However, one can span a wide range of HOMO (−5.77 to −4.17 eV) and LUMO (−3.71 to −1.77 eV) levels by tuning the ionic functionality of the CPEs. Next, we will examine in detail how the various structural components can be tuned to modulate HOMO, LUMO, and E g .
To examine the effect of the alkyl chain length on HOMO/LUMO and E g , we consider a neutral CPE (termed CPE*), which is a CPE stripped of its ionic functionality. Two sets of CPEs* are studied here, each comprised of 27 CPEs*. One set has C4 chain length while the other has C6 chain length. We compare the values of HOMO/LUMO levels and E g between the two sets, with each data point in Fig. 3a corresponding to a CPE* with the same backbone but different alkyl chain lengths (C6 vs. C4). Interestingly, there is a negligible energy difference between the two sets of CPEs*. The same results are also found between CPEs* with C8 and C10 alkyl chain lengths, suggesting that the chain length has a negligible effect on the energetics of the neutral CPEs.
We next examine the dependence of HOMO, LUMO, and E g on the donor and acceptor units of the backbone. In Fig. 3b, the HOMO and LUMO charge densities of CPE-K* are displayed with VESTA 27 . It is observed that while the LUMO of CPE-K* resides primarily on the acceptor, the HOMO spreads on both the donor and the acceptor. Bader analysis was used to compute the charge distribution on the donor and acceptor units by removing or adding an electron to CPE-K*. With the removal of the electron, the positive charges on the donor and the acceptor are 0.63e and 0.37e, respectively, and when adding an electron, the negative charges on the donor and the acceptor are −0.31e and −0.69e, respectively. Thus, it is reasonable (but not precise) to correlate the HOMO level of CPE* to the HOMO level of the donor (HOMO D ) and the LUMO level of CPE* to the LUMO level of the acceptor (LUMO A ). In Fig. 3c, we plot the HOMO level of CPE* vs. HOMO D for the three donors CPDT, P, PDT, and approximate linear relationships are found for all nine acceptor units. Thus, there exist a strong positive correlation between the HOMO level of a neutral CPE and the HOMO level of its donor unit. Similarly, as shown in Fig. 3d for each donor, there is a positive correlation between LUMO of CPE* and LUMO A . As E g = LUMO-HOMO, a positive but weaker correlation is also identified between E g and LUMO A -HOMO D in Fig. 3e. While not further pursued in this paper, these correlations can provide valuable guidance in designing donor/acceptor combinations for targeted applications. For example, donors of low HOMO levels combined with acceptors of low LUMO levels would likely to form CPEs with low bandgaps.
Next, we explore how the electrostatic force may affect the HOMO/LUMO levels of CPEs. To this end, we focus on CPEs with a fixed backbone (CPDT-BT) and anionic group (COO), but with different alkyl chain lengths (C4, C6, C8, C10) and counter ions (Li, Na, K, Cs). The varied electronegativities of the counter ions and Coulomb interaction distances due to the chain lengths modulate the electrostatic force that the polaron in the conjugated backbone may experience. In electronegativity of the counter ion; the smaller the electronegativity, the higher the HOMO.
(2) For a given counter ion or electronegativity, the HOMO level decreases monotonically with the chain length; the longer the alky chain, the lower the HOMO level. (3) The HOMO level of the neutral CPE* (HOMO*) represents an asymptotic limit of the HOMO level for all anionic CPEs; in the limit of an infinite alkyl chain length, HOMO = HOMO*. In this case, the electrostatic force vanishes. All these trends can be rationalized from an electrostatic point of view. The electrostatic interaction between a charged ion (or ionic group) and a π-orbital on the backbone is illustrated in Fig.  4b. The π-orbital of the neutral CPE* is represented by the dashed contour. When a CPE is surrounded by charged ions, its πelectrons on the backbone are attracted by the positive ions and lower their energies, while the negative ions repel the π-electrons and increase their energies 28 . This effect is purely electrostatic and becomes more pronounced as the ion-backbone distance decreases. While both the anionic group (COO − ) and the counter ion (Li, Na, K, Cs) could polarize the π-orbital, the anionic group tends to have a stronger effect than the counter ion since it is closer to the backbone. The smaller the electronegativity of the counter ion, the more the charge transfer from the counter ion to the anionic group; thus, the anionic group is more negatively charged, leading to a higher HOMO level. In the same vein, the shorter the chain length, the stronger the polarization, and the higher the HOMO level. The polarization disappears in a neutral CPE, thus HOMO* < HOMO.
The same general trends also hold in CPEs with other backbones and ionic groups, as shown in Supplementary Fig. 7. But there are also notable exceptions. For example, with the smallest counter ion (Li) and SO 3 group, the HOMO level is actually lower than HOMO*. Additionally, CPEs with the shortest chain length (C4) and smaller counter ions (Li and Na) exhibit lower HOMO level as shown in Supplementary Fig. 8. These anomalies arise because the counter ions are situated closer to the backbone than the ionic group. Since the smaller counter ions (Li and Na) are more mobile, it is easier for them to circumvent the anionic group and to come near the backbone. This scenario is more likely to occur when the chain length is short as the counter ion can bond with two adjacent anionic groups simultaneously, facilitating the counter ion to draw closer to the backbone. The LUMO level exhibits similar trends as the HOMO (Supplementary Fig. 9) which is expected since E g shows weak dependence on the electrostatic force.
The preceding analysis has demonstrated that the frontier orbitals and E g of CPEs can be modulated by the HOMO/LUMO levels of the donor and acceptor units in the backbone, the alkyl chain length, and the electronegativity of the ionic functionality. However, owing to complex correlations among these quantities, concrete structure-property relationships, necessary for the rational design of CPEs, are difficult to establish. Thus, in the following, we resort to machine learning to derive appropriate structure-property relationships for predicting CPE properties. Since each ionic group has an internal structure, the electronegativity alone would not be sufficient to represent the characteristics of the ionic group. Hence, we also include its coordination number (CN) in the machine learning modeling. Specifically, the CN of each of the three ionic groups (COO, SO 3 Tables S4 and S6. The Pearson correlation coefficient (p) is commonly employed to represent a linear relationship between a given feature and a property in machine learning analysis, and it is defined in the following: 29 where [X i ] and [Y i ] are corresponding datasets for the feature and property, respectively, and n is the number of CPEs in the dataset. X(Y) is the averaged value of the feature (property). A greater positive (or negative) p value indicates a stronger correlation (or anti-correlation) between the property Y and the feature X. While the Pearson correlation coefficient analysis is simple and transparent, it is not always accurate specially when nonlinear relationships between features and properties are important. To remedy this deficiency, we also perform Random Forest Regression (RFR) in our analysis to complement the Pearson coefficient. RFR is a supervised learning algorithm that utilizes ensemble learning method for regression, and thus is more reliable than the linear analysis. With RFR, we can rank the relative importance of each feature 30 , represented by the feature importance variable Q (see Supplementary Methods). In Fig. 5, we summarize the Pearson correlation coefficients (p) and feature importance (Q) for the eight features on HOMO/LUMO levels and E g . On the one hand, the results in Fig. 5a confirm what we have already known, i.e., E g depends primarily on the backbone, with HOMO D and LUMO A playing the most important roles. On the other hand, the machine learning analysis can also reveal more concrete and sometimes hidden correlations. For instance, LUMO D and to a less extent HOMO A are found to be important to E g . While LUMO D, HOMO A and LUMO A are correlated to E g , HOMO D is anti-correlated to E g . As discussed earlier, HOMO of a CPE is expected to depend on HOMO D . Interestingly, however, we also uncover that the HOMO of a CPE depends on strongly on LUMO D and HOMO A , as shown in Fig. 5b. In fact, from the RFR analysis, HOMO A is revealed as important as HOMO D in modulating the HOMO of the CPE. This revelation is actually consistent with Fig. 3b which shows that the HOMO of CPE-K* spreads on both the donor and acceptor units. In addition, as displayed in Fig. 5c, the linear Pearson analysis grossly overestimates the importance of HOMO A in determining the LUMO of a CPE, as compared to the RFR analysis. To our surprise, the donor unit appears to have negligible influence on the LUMO level of the CPE. The RFR analysis indicates that the LUMO level depends almost exclusively on LUMO A while the HOMO has roughly an equal dependence (but with different signs) on HOMO D , LUMO D and HOMO A . The latter reflects the significant interaction/charge transfer between the donor and the acceptor on the backbone. Although the ionic group, chain length and counter ion individually plays a minor role in determining the HOMO/LUMO levels, collectively they can change the HOMO/ LUMO levels as much as 0.5 eV, thus cannot be discounted in the machine learning analysis. Among them, the alkyl chain length (L) and the electronegativity of the ionic group (χ−) are less important to the HOMO/LUMO levels. In particular, χ− has negligible influence on them, thus is removed from the further analysis. In contrast, the coordination number (CN) of the ionic group outranks L, χ− and χ+ in importance, suggesting that the design of the ionic group should not be taken lightly.

Predicting CPE properties via machine learning
Our goal in machine learning is to provide guidance for materials design, i.e., to discover CPEs for target applications. To accomplish this goal, we need to uncover material features that can be directly linked to the molecular structures of CPEs. In addition, we aim to predict materials properties (HOMO, LUMO and E g ) for a given CPE structure. Progress is made on both tasks here. Since the HOMO/LUMO of the donor and acceptor units are not directly linked to the molecular structures of CPEs, they are removed from the feature list. For the first task, we include the following revised features to describe the donor and acceptor units (Supplementary Table 6) in the machine learning analysis: the degree of unsaturation (defined as the number of rings and π bonds) in the donor and acceptor units (DU D , DU A ), the differential electronegativity between a non-carbon atom and a carbon atom on the ring of the donor and acceptor unit (ΔD 1 , ΔA 1 ) (i.e., ΔD 1 and ΔA 1 are zero if all atoms on the ring are carbon), and the differential electronegativity between a substituent group and a hydrogen atom in the donor and acceptor unit (ΔD 2 , ΔA 2 ) (i.e., ΔD 2 and ΔA 2 are zero if there is no substituent group). Since ΔD 2 is the same for the three donors in the database, this feature is removed in the following analysis, but should be included in general. In addition to the five features, the same ones used to represent the ionic functionality (CN, L, χ− and χ+) are included in the analysis. More information on the degree of unsaturation and differential electronegativity can be found in Supplementary Methods. In Fig. 6a-c, we present the Pearson correlation coefficient p and the feature importance Q for E g and HOMO/LUMO levels using the revised features. As expected, the five features characterizing the donor and acceptor units are all important to E g . While DU D is positively correlated to E g , other features (DU A, ΔD 1 , ΔA 1 and ΔA 2 ) are anti-correlated to E g . Thus, in order to lower E g , one needs to reduce the degree of unsaturation on the donor. As shown in Supplementary Table 4, with the degree of unsaturation (DU D ) decreasing from P, to PDT and to CPDT, the corresponding HOMO D increases. According to Fig. 5b, c, the HOMO of CPEs is positively correlated to HOMO D while the LUMO of CPEs has negligible dependence on HOMO D . Thus, the decrease of DU D would result in a reduction of the bandgap. Other ways can be used to reduce E g are increasing the degree of unsaturation on the acceptor, increasing the electronegativity of the non-carbon atoms on the ring, and increasing the electronegativity of the substituent group. Three most important features for HOMO are DU D , ΔD 1 and ΔA 2 . Thus, one can increase the HOMO level by lowering the degree of unsaturation on the donor and the electronegativity of the substituent and by increasing the electronegativity of non-carbon atoms on the ring. The most important features to the LUMO level are all related to the acceptor; and to increase LUMO, one can reduce the degree of unsaturation, the electronegativity of non-carbon atoms on the ring and the substituent group of the acceptor. These structure-property relations can provide valuable guidance for rational design of CPEs.
For the second task, we have used machine learning to predict HOMO/LUMO levels for CPEs beyond the database. Here, the Support Vector Regression (SVR) method as implemented in "scikit-learn" 31 was used for the regression of the HOMO and LUMO levels (see Supplementary Methods). Specifically, a 10-fold cross-validated SVR approach with the grid search was used and its performance was evaluated by the mean-absolute-percenterror (MAPE), the root-mean-square-error (RMSE) and the correlation coefficient (R 2 ) for the prediction of HOMO/LUMO levels, as shown in Supplementary Fig. 10. We find that the performance of the SVR regression model is stable and robust against different randomizations. In Fig. 6d, e, we display the comparisons of HOMO/LUMO levels between the machine learning predictions and the first-principles results, with 1046 CPEs in the training set and 250 CPEs in the test set. It is found that the machine learning model performs well with MAPE less than 2.1%, RMSE less than 0.07 eV and R 2 larger than 0.97 (Table 1) the machine learning model in predicting HOMO/LUMO levels beyond the database, we also constructed a verification dataset (Supplementary Tables 6 and 7, Supplementary Fig. 11), which has not been previously used in the training and test sets. With all known CPEs (1296) in the database as the training set, we can predict HOMO/LUMO levels of CPEs in the verification set and the results are summarized in Fig. 6d, e and Table 1. In general, we find that the machine learning model can provide reasonable estimate for the HOMO/LUMO levels, and the revised structural features can be used as descriptors in predicting the properties of unknown CPEs.

Screening CPEs for optoelectronic and photocatalytic applications
CPEs have found a wide range of applications from optoelectronics to photocatalysts. However, rational design of CPEs has yet to benefit from data-centric research. In this work, we use machine learning to screen/discover CPEs for two selective applications. For the first application, we focus on the design of hole transport materials (HTMs) in metal-halide perovskite based optoelectronic devices, such as photovoltaics and light-emitting diodes 5,7,10,32-34 .
For the second application, we aim to design photocatalysts for overall water splitting 13,14,35 . Both applications require the design of CPEs with appropriate HOMO, LUMO, and E g values, along with other relevant properties.
For the first application, the ideal HTMs should have appropriate energy levels so that the holes can be transferred from the active layer to the electrode efficiently while the electrons are blocked 32,36 . The solution-processed pH-neutral CPEs have excellent compatibility and wetting property, which enables the formation of a stable, defect-suppressed and uniform interface with the perovskite active layer 37 . Here, three design criteria are considered for the screening of CPEs: stability, energy level alignment and hole mobility. Specifically, the CPEs should have a higher energetic stability than CPE-K, which has been used as HTMs in conjunction with perovskites. We require the candidate to have an atomization energy lower than that (−0.239 eV) of CPE-K. Note that this is not a critical criterion and helps to reduce the number of candidates that we have to consider. Secondly, we require that the HOMO level of the candidate be slightly higher than the valence band maximum (VBM) of the perovskite, but lower than the VBM of the ITO (indium tin oxide) electrode (Fig. 7). Thus, photogenerated holes in the active layer can be efficiently extracted. At the same time, in order to block the electron transfer from the perovskite to the HTM, the LUMO of the candidate should be lower than that of the perovskite (~3.7 eV). Thirdly, we require that the candidate should have a lower reorganization energy than CPE-K (0.186 eV). This requirement derives from the consideration of charge-transfer rates (K) in HTMs, which can be Here ℏ, T and k B are reduced Planck constant, temperature and Boltzmann constant, respectively. λ is the reorganization energy and V is the electronic coupling. For polymers including CPEs, the reorganization energy is often the dominant factor on the chargetransfer rate, and the smaller the reorganization energy, the higher the charge-transfer rate 40 . The reorganization energy for the holes in CPEs can be estimated as follows: 40 where E þ 0 is the energy of the cation in the optimized neutral state structure; E þ þ is the energy of the cation in the optimized cationic structure; E 0 þ is the energy of the neutral molecule in the optimized cationic structure, and E 0 0 is the energy of the neutral molecule in the optimized neutral state structure. Thus, the candidate should have λ less than that (0.186 eV) of CPE-K. Finally, the candidate should have polar functionality, which is a condition satisfied by all CPEs in the database. Among~1300 anionic CPEs in the database, 72 candidates met the above criteria. Among them, 5 candidates with the lowest reorganization energies are included in Table 2. It is interesting to note that the CPE with a low reorganization energy generally has a longer alkyl chain length than CPE-K (C4). As shown in Supplementary Fig. 12, for a given backbone (CPDT-BT) and functionality (SO 3 K), as the alkyl chain length increases, the hole reorganization energy decreases. Therefore, we propose to increase the alkyl chain length on the basis of CPE-K for synthesizing more efficient HTMs.
CPEs have also received recent attention as photocatalysts for water splitting. Some of the advantages that CPEs offer as photocatalysts include turnability, low-cost with earth abundant and non-toxic materials, and water solubility 13,14,35,41 . In addition, the polar side chains of CPEs can interact strongly with metal electrodes and/or metal co-catalysts, yielding efficient charge extraction/transfer to the electrodes and the co-catalysts 42 . However, insofar the photocatalytic activities of CPEs remain low.
To serve as photocatalysts for overall water splitting, the HOMO and LUMO levels of CPEs should straddle across the redox potentials of water 43 . As the overpotential for an efficient hydrogen evolution reaction (HER) and oxygen evolution reaction (OER) catalyst is about 0.2 eV 44,45 , we require that the candidate to have its HOMO level lower than −5.87 eV and LUMO level higher than −4.24 eV, respectively to ensure that the photo-induced hole and electron can overcome the overpotential 46 . In addition, the bandgap of the candidate should be less than 2.4 eV for efficient light absorption. Finally, we require the candidate to have an atomization energy lower than that of CPE-K. Based on these criteria, 17 candidates are identified for photocatalytic water splitting, and 5 of them with the lowest E g are included in Table 2. We find that all promising candidates have "P" molecule as the donor unit. This observation is consistent with the fact that all CPE photocatalysts for water splitting reported so far contain "P" in their backbones. In addition, compared to the CPEs that can catalyze only HER (half-water splitting) 35 , our predicted candidates have deeper HOMO levels in the acceptors (e.g., FBT and PhCN), which enable overall water splitting and lower bandgaps. Thus, to improve the performance of CPEs as water splitting photocatalysts, one should focus on engineering acceptor units, such as by tuning their ΔA 1 and ΔA 2 .
Although the machine learning and data driven approach presented here are powerful tools in discovering CPEs for target applications, one has to bear in mind their limitations too. More data (in particular, experimental data) are critically needed to improve the fidelity of the predictions. More sophisticated theoretical models have to be developed to capture crucial physical/chemical processes, including charge transport, the role of solvent and defects, interfacial wettability, lifetime of charge carriers, etc.
In conclusion, we have constructed a first-principles database on CPEs, which allows us to systematically examine how HOMO/ LUMO levels and E g depend on the structural components of CPEs, including the donor/acceptor units on the backbone, the alkyl chain length, the ionic group and the counter ion. General trends correlating HOMO/LUMO and E g with the structural components are established and can be rationalized in terms of the electrostatic interaction between the ionic functionalities and polarons on the CPE backbone. By means of machine learning, we have uncovered material features important to frontier orbitals and E g and the structure-property relations for rational design of CPEs. These features are then used as descriptors in the machine learning to predict HOMO/LUMO levels and E g of "unknown" CPEs  beyond the database. Finally, we propose relevant design criteria based on which promising CPEs are identified as the hole transport materials in halide perovskite optoelectronic devices and as photocatalysts for water splitting. Our work represents one of the data-centric research dedicated to CPEs and could accelerate the discovery and development of CPEs for optoelectronic and photocatalytic applications.

Computational models and details
Our first-principles calculations are based on density functional theory (DFT) with planewave bases as implemented in Vienna Ab initio Simulation Package (VASP) [47][48][49][50] . The projector-augmented-wave method was used to describe the pseudopotentials. Both Perdew-Burke-Ernzerhof (PBE) and HSE06 hybrid functionals [51][52][53] were considered, with the former used on the entire CPE database and the latter on a subset of the database (Supplementary Table 1). As conjugated polymers, CPEs could adopt various conjugation lengths. Obviously, we cannot include all possible conjugation lengths for all CPEs (~1300) in the database. Instead, we focus on CPE oligomers with a single repeating conjugation unit. However, we have also considered a subset of CPEs with infinite conjugation lengths. In particular, we have examined the variation of HOMO/LUMO/E g energies as a function of the conjugation length for CPE-K ( Supplementary Fig. 4). To model CPEs with a single unit of conjugation length, we included a vacuum of 15 Å in each direction of the computational cells and found it sufficient to isolate periodic images of CPE molecules. Due to the steric hindrance, a fully extended and trans-conformational alkyl chain is energetically more stable than other conformations, thus is adopted as the initial structure for each CPE molecule 54 . We then relax the initial molecular structure to obtain its equilibrium conformation at 0 K. For a few select CPEs, we have also performed ab initio molecular dynamics at 300 K to examine whether the relaxed equilibrium structures remain stable (i.e., whether there are drastic conformational changes to the equilibrium structures). All equilibrium structures examined are found to be thermodynamically stable and can be used as bases to understand the structure-property relationships. Although local minima exist in the molecular structures of CPEs, selective analysis finds them to have minor influence on the results. An energy cutoff was set as 500 eV and the convergence threshold for the energy and force were 10 −5 eV and 0.02 eV/ Å, respectively. The inclusion of van del Waals dispersion correction at the level of PBE+D3 55 is found to have negligible effect on the energetics and the structures of CPEs ( Supplementary Fig. 5, Supplementary Table 2). It is known that PBE functional generally underestimates E g compared to HSE06 functional and experimental values. Interestingly, as shown in Supplementary Table 1, the ratios between the corresponding HSE06 and PBE values for CPEs are more or less constant; in particular, the ratio for E g is~1.4 and for HOMO level is~1.1. Since the HSE06 calculation is much more expensive than PBE, one could save tremendous computational time by scaling the PBE values (of E g and HOMO) by the constant ratios to approximate the corresponding HSE06 values without actually performing the HSE06 calculations on the entire set of CPEs in the database. On the other hand, it is also known and confirmed by our calculations that E g for CPEs with infinite conjugation lengths is always lower than that for CPEs with a single conjugation unit as well as the experimental value. Actually, modeling CPEs with a single conjugation unit by PBE functional turns out to yield good agreements to the experimental values on E g as shown in Supplementary Table 3. As for the HOMO level, we find that HSE06 functional with a single unit of conjugation length yields results consistent with the experiments. Therefore, when screening CPEs for target applications, PBE functional will be used for E g while HSE06 functional will be used for HOMO level, both with a single unit of conjugation length.

DATA AVAILABILITY
The data that support this work are available in the article and Supplementary Information file. Further raw data can be found at the CPE database: https:// cpegenome.com and are available from the corresponding author (G.L.) upon request.