Machine learning for accelerating the discovery of high-performance donor/acceptor pairs in non-fullerene organic solar cells

Integrating artificial intelligence (AI) and computer science together with current approaches in material synthesis and optimization will act as an effective approach for speeding up the discovery of high-performance photoactive materials in organic solar cells (OSCs). Yet, like model selection in statistics, the choice of appropriate machine learning (ML) algorithms plays a vital role in the process of new material discovery in databases. In this study, we constructed five common algorithms, and introduced 565 donor/acceptor (D/A) combinations as training data sets to evaluate the practicalities of these ML algorithms and their application potential when guiding material design and D/A pairs screening. Thus, the best predictive capabilities are provided by using the random forest (RF) and boosted regression trees (BRT) approaches beyond other ML algorithms in the data set. Furthermore, >32 million D/A pairs were screened and calculated by RF and BRT models, respectively. Among them, six photovoltaic D/A pairs are selected and synthesized to compare their predicted and experimental power conversion efficiencies. The outcome of ML and experiment verification demonstrates that the RF approach can be effectively applied to high-throughput virtual screening for opening new perspectives to design of materials and D/A pairs, thereby accelerating the development of OSCs.


INTRODUCTION
In the latest decade, the discovery of novel photoactive donor (D) and acceptor (A) materials has greatly promoted the development of bulk heterojunction (BHJ) organic solar cells (OSCs) [1][2][3][4][5][6][7] . In this process, tens of thousands of photovoltaic materials have been designed and synthesized by chemists and materials scientists. More than 50,000 publications focusing on OSCs have been tracked by Web of Science (see Supplementary Fig. 1) in the latest decade. More than 60% of these publications are based on material science, mainly involved in material design for photoactive layers in OSCs. Apart from materials reported in these publications, there are imponderable materials, which have been synthesized and characterized while they have not been reported ever owing to low performance in OSC devices. It mainly results from failed design strategies of D/A materials and/or unsuitable synergistic effects of D/A combinations. Clearly, the consumption on trial and error experiments is huge. As shown in Fig. 1, power conversion efficiency (PCE) can only be recognized after manual designing, material engineering, and devices engineering. In other words, development of photovoltaic materials is an experiencebased screening process, whose essence is to include a lifecycle of "Discovery-Synthesis and Research-Optimization-Evaluation". In addition to material design and synthesis, which involves new route, structure characterization, material purification, etc., the process of material optimization and evaluation is also uncertain. On one hand, the efficient device performance is achieved under the synergistic effects of donor/acceptor (D/A) combinations, including complementary absorption 8 , matched energy levels 9 , favorable blend morphology with nanoscale phase separation 10 , high and balanced charge transport property [11][12][13] , etc. On the other hand, adjusting processing parameters such as D/A ratio, processing solvent and solvent additive, pre-and post-treatments, selection of film forming technology and so on [14][15][16][17][18] , and constructing device architectures via interface engineering and device engineering [19][20][21] , are also tedious and laborious work. Obviously, the input-output ratio as discussed above generally is unequal in the field of photovoltaic material research. To reduce the consumption of trial and error experiments and accelerate the discovery of high-performance materials, developing strategies based on shortening the lifecycle of material development will become extremely important in the future research of OSCs.
Thanks to the contributions of many computer scientists in terms of the combination of artificial intelligence (AI) and big data 22 , machine learning (ML) technique, which is a computational method, spread quickly into many research fields and commercial projects, such as drug discovery 23 , genome sequencing 24 , metalorganic frameworks materials 25 , and organic light-emitting diodes 26 , etc. In material science, ML is an ideal tool that can effectively learn from past massive data sets and mechanisms, automatically generate structures, assess their electronic features and other properties, determine the underlying rules among these data sets and build scientific models to make predictions with reasonable accuracy [27][28][29][30][31][32] . As a result, compared with hundreds of hours or more it takes to evaluate a compound by experiment, the models based on appropriate algorithms can predict properties in a few minutes. It is foreseeable that applying ML methods to the design of photovoltaic materials or systems will greatly accelerate the discovery of high-efficient materials, reduce the research lifecycle, and promote the development of OSCs.
Some ML approaches have been introduced into the OSCs for the screening of complex molecules and their PCE prediction of relevant photovoltaic systems [33][34][35][36][37][38][39][40] . For instance, Sun et al. 33 developed a deep learning model based on a convolutional neural network (CNN) that enables recognition of chemical structures and automatic classification for predicting the PCE of D materials. Shinji et al. reported a screening of conjugated molecules for polymer-fullerene OSC applications with artificial neural network (ANN) and random forest (RF) algorithms by importing parameters such as PCE, molecular weight, energy levels, and electronic properties with digitized chemical structures 36 . However, these examples of the application of ML algorithms were mostly used in the fullerene-based OSCs, whereas the reports about the application of ML algorithms in the non-fullerene-based OSCs are limited [41][42][43] . As non-fullerene acceptors (NFAs) draw researchers' attention and have become research hotspots [44][45][46] , and most state of art OSCs with efficiency up to 13-17% were achieved by NFA OSCs in these few years 4,47,48 , we shoud pay more attention to the applications of ML approaches that would tackle broader and more complex non-fullerene OSCs. Notably, the main challenges associated with screening the ideal photovoltaic materials from diverse non-fullerene systems are not only the selection of optimal algorithms but also the construction of training data sets. D/A pairs rather than single D or A materials as parameters generally have been considered to identify the device performance, and give the effective instructions on materials designing, morphology optimization and device engineering. Although ML methods have great potential in discovering materials for non-fullerene OSCs, they have not been fully explored, compared and verified systematically.
In this work, we performed ML methods including linear regression (LR), multinomial logistic regression (MLR), RF, ANN, and boosted regression trees (BRT) to investigate polymer-NFAbased OSCs devices based on the D-A copolymer Ds and ladder fused non-fullerene As with A-D-A configuration. Detailed description of these algorithms was provided in Machine Computation section of Supplementary information and illustrated at Supplementary Fig. 2. Our models are constructed on the basis of 565 D/A combinations as training data sets-relevant chemical structures of NFA systems as well as their photovoltaic parameters obtained from the literature. The correlation between D/A pairs and the PCE predication for the polymer-NFA OSC devices is validated via the above-mentioned ML algorithms. After ML, RF and BRT algorithms showed the higher maximum Pearson's correlation coefficient (r) values as compared with the other algorithms. Thus, >32 million D/A combinations were further calculated by these two models for predicting the potential PCEs of these D/A combinations. Among them, six D/A pairs as photoactive systems were selected and synthesized to compare their predictive and experimental PCEs. It was found that the developed RF model can be applied effectively to highthroughput virtual screening for guiding the compatible design and characterization of new D/A combinations in nonfullerene OSCs.

Research Workflow
The ML-assisted materials screening and experimental validation of new D/A pairs are depicted in Fig. 2. Frequently reported new D/A pairs in recent years serve as reliable suppliers of training data. These photovoltaic materials have definite structures while   Here, data of 477 D/A pairs were inputted for training to generate predictive models. Then, data of 88 D/A pairs were served as testing set to evaluate the predictive capability of models based on the five algorithms, which are introduced from Microsoft Azure ML studio for predicting PCEs. (4) The models computed 32,076,000 potential D/A pairs, from the results of which, six new D/A pairs are selected by experiments to further confirm the performance of ML methods in screening counterpart for active layer materials. (5) The selected D/A pairs will be synthesized and characterized to evaluate the performance of ML models measured using root mean squared error (RMSE) metrics.

Data
In all, 565 D/A pairs were collected from the literature (274 papers). The details of these D/A pairs together with their photovoltaic parameters and corresponding references are provided in the Supplementary Information (see Supplementary  Table 11). Transferring chemical structure into input data critically to mitigate personal impact on the accuracy of predictive models has a vital role in determining the accuracy of models. In this study, the chemical structure was divided into several fragments with a series of binary type numbers by merging highly similar structures and numbering them, and further transformed into machine language. D-A copolymer Ds usually have donor units (e.g., benzodithiophene (BDT)) and As (e.g., thiazolo [5,4-d]thiazole (TTz)) flanked with or without π bridges (such as thiophene, thiazole), whereas non-fullerene As have structure with donor core (D), side chains, electron-deficient end group (A) and, in some cases, π bridges. For example, as shown in Fig. 3, polymer D J71 was divided into four fragments as part 1 (BDT D unit), part 2 (thiophene π bridge on the left side), part 3 (TTz A unit), and part 4 (thiophene π bridge on the right side). On the total, there are 31 part 1 fragments, 14 part 2 fragments, 27 part 3 fragments, and 14 part four fragments summarized for the collected D materials. Of note that random copolymers as well as regioregularity of common polymer backbone were not collected. As for nonfullerene As, single molecular structure was divided into five parts.
Normally, during the synthesis of D backbone of A-D-A NFAs, part 1 unit was coupling linked with two part three units. Accomplished with the formation of part 2, the D backbone is obtained with a ring-cycling reaction. The π bridges and end groups were defined as part 4 and part 5, respectively. To simplify computation, noncentrosymmetric were not included. On the total, there were 30 part one fragments, 18 part two fragments, 6 part three fragments, 22 part four fragments, and 35 part five fragments summarized for A materials. The chemical structures of these fragments are listed in Supplementary Fig. 3.
In addition to structure reasons, these materials were divided into several parts for their different roles in determining properties such as energy levels and aggregation behaviors, and further determining PCE, which means different parts have different weightings in determining predictive results. For example, the end groups (part 5 of As) play a vital role in determining lowest unoccupied molecular orbit levels, whereas highest occupied molecular orbit (HOMO) levels depend more on D core (part 1, part 2, and part 3 of As). Moreover, lamellar stacking is usually adjusted by side chains (part 2 of As). For some polymer Ds and NFAs, they are not constituted by all kinds of parts. For example, many NFAs do not have π bridges (part 4 of As). In this case, missing fragment is also numbered with individual binary number. As a result, the D/A pairs based on two series of structural fragments can be numbered and transformed successfully to digitized chemical structures.
Predicted results As shown in Fig. 4a, 84.4% of data (477 in 565 D/A pairs) were randomly inputted for training and models generation, and the rest of data were randomly selected as validation set and run with five models to test their accuracies in predicting PCE. Figure 4b-f show the training results of the five algorithms, where the horizontal and vertical axes represent the experimental PCE and predicted PCE, respectively, and r is the correlation coefficient. After training and testing the datasets, both RF and BRT models exhibit satisfactory performance in predicting PCE with higher r values of 0.70 for RF and 0.71 for BRT, respectively, as summarized in Supplementary Table 1. The experimental PCE and predicted PCE obtained from LR, MLR, and ANN methods have relatively lower correlation, with r values of 0.54, 0.59, and 0.60, respectively, which are far below the perfect positive correlation and insufficient for practical use. The RF and BRT models repeatedly fit many decision trees to improve the accuracy of the model. They are known to be robust even in the presence of several  Results of classification PCEs in these data were divided into three groups, high level (A, PCE > 11%), moderate level (B, 7% ≤ PCE < 11%) and low level (C, 0 ≤ PCE < 7%). The row and column represent the experimental and predicted classifications of the PCE, respectively, such that the diagonal line corresponds to the correct answer. As shown in the columns of Fig. 4d Supplementary Fig. 5). It implies that if more data were involved, the accuracy of our models can be improved 33 . Unlike the evaluation of B level, the C level for predicting PCE shows the huge uncertainty, demonstrated by their accuracies of predicting PCE, which are 5.9%, 20.0%, 23.5%, 76.5%, and 58.8% for LR, MLR, RF, BRT, and ANN, respectively. From this point of view, we should also pay close attention to the value of undesired experimental results, most of which are not published. Typically, relevant D/A pairs and their photovoltaic performance also work as important references not only for manual designing of materials but also for ML. As RF and BRT exhibit favorable performance in predicting PCE and give correct answers with great accuracy for the D/A pairs, these two methods are applied to predict the PCEs of automatically generated 32,076,000 D/A combinations, and further evaluate the practicability of these two ML algorithms.
The data set consisting of structural fragments of D/A pairs can be used as a new database that automatically generates new D/A combinations for potential PCE prediction. Based on fragments listed in Supplementary Fig. 3, resulting from 565 D/A pairs, there are over 32,076,000 new D/A combinations generated. The distributions of predictive PCE based on RF and BRT methods are shown in Fig. 5a, b, respectively. According to results outputted by RF model (in Fig. 5a), 12.27% of D/A combinations can possibly achieved PCE exceeding 11%, and it is 14.15% for BRT model (in Fig. 5b) Here, we further extracted results of PM6 containing D/A combinations to investigate the statistical characteristics of high-performance active layer materials. As shown in Supplementary Fig. 6, there are 17,820 results for PM6 in each model, of which 13.47% is in A level for RF models, and 33.63% in A level for BRT model. As with our expectation, the PCE distributions of PM6 tend to move toward district of higher PCE compared with the overall distribution of 32,076,000 results, which is more obvious in BRT model.

Experimental validation
In these D/A combinations with PM6 as D material, three NFAs including Y-ThCN, Y-ThCH3, and Y-PhCl were selected, synthesized, and characterized to demonstrate how ML methods help to screen counterpart for active layer materials. These D/A combinations were selected for the following reasons: (i) they were predicted to have high performance with PCE over 10%; (ii) the new materials are easy to synthesized by one-step reaction with all raw materials commercially available. It is worth mentioning that not all materials in the predicted D/A combinations can be synthesized in laboratory. Some materials exist only in theory. Besides, to enrich the experiment sample size, the D/A combinations consisting of PBDB-T and Y-ThCN, Y-ThCH3, and Y-PhCl were also characterized. These combinations were further applied to verify the PCE predication, and evaluate the predictive ability of RF and BRT models. The chemical structures and the performance of selected D/A combinations was listed in Fig. 6 with corresponding photovoltaic parameters and predictive PCEs summarized in Table 1. The details of the synthesis of NFAs are described in Experimental section of Supplementary Information and their 1H nuclear magnetic resonance ( 1 HNMR) spectra were provided at Supplementary Figs. 7-9.
As the predicted results from BRT method, PM6:Y-ThCN, PM6:Y-ThCH 3 , and PM6:Y-PhCl binary systems can achieve PCEs of 11.56, 11.14, and 13.30%, respectively. In addition, if the RF method is used as the ML model, these three systems can exhibit the predicted PCEs of 10.52% for PM6:Y-ThCN, 10.41% for PM6:Y-ThCH 3 , and 13.33% for PM6:Y-PhCl, respectively. After the systematic optimization of photoactive layers, as presented in Supplementary Table 3-10, devices based on PM6:Y-ThCN and PM6:Y-PhCl achieved high PCEs of 13.18% and 15.71% (Fig. 6b), respectively, which are not very far off the predicted PCEs of both BRT model and RF models (Table 1). However, the predicted and experimental PCEs of PM6:Y-ThCH 3 are not at the same level. For PBDB-T based system, PBDB-T:Y-ThCN, PBDB-T:Y-ThCH 3 , and PBDB-T:Y-PhCl are predicted to achieve PCEs of 12.73, 12.49, and 12.55% by BRT method, and 11.49%, 11.64%, and 11.32% by RF method. Their experimental PCEs show a high agreement with their predicted results with 11.02%, 11.08%, and 11.19% (Fig. 6c). Owing to higher HOMO levels of PBDB-T, the PBDB-T system had lower open circuit voltage (V OC ) 49 . All systems except PM6:Y-ThCH 3 have high current density (J SC ) and fill factor (FF) as shown in Table 1. External quantum efficiency (EQE) of the representative devices is given in Supplementary Fig. 10, which are well correlated with J SC . Generally, the experimental PCEs of most system agree well with the predicted PCE, indicating that both methods are qualified to provide instruction on counterpart materials screening in active layers.
To further screen the optimal methods for PCE prediction, RMSE values between predicted PCE and experimental PCE were introduced from this equation, where y i and y 0 i represent predicted and experimental PCE value, respectively, and n is the number of D/A combinations characterized in this experiment. The unit of percentage is ignored to make the RMSE value more apparent. For BRT method, the RMSE value is 2.42, whereas it is 1.17 for RF method. Obviously, RF method shows a narrower deviation in predicting PCE of D/A combinations, indicating that it is more useful to accelerate the discovery of high-performance D/A pairs in OSCs.

DISCUSSION
The deviation between prediction and experiment seems inevitable, because experimental data are not always "reliable" enough 50 . As we know, PCEs of photovoltaic materials are sensitive to the materials purity, experimental operation, external environment, etc. One D/A combination might have different experimental PCEs in different laboratories. Usually, the experiment results probably lost repeatability owing to difference in trace solvent vapor in atmosphere, humidity, temperature and sometimes critical operating details of researchers in different labs and even in same labs. Another well-known batch-to-batch vibration exists in polymer materials, which tend to have different molecule weight and molecule weight distribution. For example, during the period of our research, Hou's group also published PM6:Y-PhCl combination. In their work, the maximum PCE was 16.5%, which is higher than that of our work 51 . It probably results from the different batches of PM6 materials and experimental conditions. It should be noted that, although the laboratory efficiency fluctuates below the theorical highest efficiency that one D/A combination could achieve, and the difference is difficult to sweep out, the laboratory efficiency can still be used to roughly evaluate the real performance of the materials in OSC devices because almost every researchers spared no effort to narrow the gap between laboratory efficiency and theorical efficiency of their materials, as higher PCE is one of the most important pursuit of material scientist in OPV filed. Therefore, the gap between the laboratory efficiency and theorical efficiency causes deviation between prediction and experiment, yet it is not so huge to make the models invalid. On the other side, although we cannot guarantee that the experimental PCE is exactly the highest PCE it can achieve, because there are so many conditions we have not tested 52 , big data processing shows its superiority that it can weaken the influence of experimental errors to some extent, so results from ML methods still have valuable instructions on experiment. From this respect, the RF and BRT models can be upgraded by adding the processing conditions and the specific physical and chemical properties of materials as well as the new performance evaluation parameters in the future work. For instance, as we previously reported, an industry figure of merit, which integrates device efficiency, photostability, and synthetic complexity to evaluate materials can be introduced into these models to better described the comprehensive performance in practical application 7 . Although there is deviation between results of experiments and simulations, it doesn't hinder us from exploring its application in OSCs.
In summary, our work offered a paradigm for involving ML in solving problems in OSCs, which used to rely on vast laboratory work. Five models were built on LR, MLR, BRT, RF, and ANN algorithms. Although 565 D/A pairs as training data sets were collected to evaluate the predictive ability of these ML algorithms. Contrary to the low correlation coefficient in LR, MLR, and ANN, both BRT and RF models exhibited improved accuracy in predicting PCE with high r values of 0.71 for BRT and 0.70 for RF, respectively. In addition, the BRT and RF methods were evaluated in practical application by screening 32,076,000 D/A combinations. By analyzing results outputted by these two methods, six new D/A combinations, including PM6:Y-ThCN, PM6:Y-ThCH 3 , PM6:Y-PhCl, PBDB-T:Y-ThCN, PBDB-T:Y-ThCH 3 , and PBDB-T:Y-PhCl combinations, were characterized for further evaluating the practicability of these two algorithms applied into OSCs. Furthermore, by integrating and comparing the predicted data and experimental results optimized, it can be concluded that the RMSE value matches nicely with the RF model in the predicting PCE, indicating that it is more suitable for screening new promising materials from predictive statistical data generated. Our results demonstrate that ML methods, especially for RF   model, are alternative strategies to accelerate the discovery of high-performance D/A combinations and can provide the decision making of molecular design in OSCs or other organic electronics.

Data collection of non-fullerene-based OSCs
The data set was constructed by collecting 565 D/A pairs published between 2015 and 2019. And it was restricted to binary OSCs based on non-fullerene small molecule A and polymer D.

Model building
The ML model was built based on LR, MLR, RF, BRT, and ANN algorithms with 84.4% of collecting data for training set and 15.6% for testing set. The work of computation was achieved on this website: https://studio.azureml.net/.

Model performance metrics
The models built in this work were evaluated by mean absolute error, RMSE, and Pearson's correlation coefficient (see Supplementary Table 1). In addition, they were further validated by experiment.