A new model for predicting boiling points of alkanes

A general perception among researchers is that boiling points, which is a key property in the optimization of lubricant performance, are difficult to predict successfully using a single-parameter model. In this contribution, we propose a new graph parameter which we call, for lack of better terminology, the conduction of a graph. We exploit the conduction of a graph to develop a single-parameter model for predicting the boiling point of any given alkane. The model was used to predict the boiling points for three sets of test data and predicted with a coefficient of determination, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2=0.7516,~0.7898$$\end{document}R2=0.7516,0.7898 and 0.6488, respectively. The accuracy of our model compares favourably to the accuracy of experimental data in the literature. Our results have significant implications on the estimation of boiling points of chemical compounds in the absence of experimental data.

A graph G = (V , E) is a mathematical object which consists of a finite set V of elements called vertices, together with a set E, of 2-element subsets of V, called the edges of G. As early as 1875, Cayley (see for instance 1 ), in his quest to enumerate chemical molecules called alkanes, he made an observation that molecules can be modelled by graphs where atoms are represented by vertices and two vertices are joined by an edge if the corresponding atoms are linked by bonds. This graph model became widely known as the molecular graph. An interesting numeric value attached to each molecule is the boiling point, i.e., the temperature at which a substance has a vapour pressure of 760 mmHg 2 . Whilst the boiling point is a key property in classifying molecules, such as alkanes which dominate mixtures of lubricants in industry, there is generally lack of experimental boiling point data (see, for instance 1 and references cited therein). Nevertheless, although expected to be accurate-experimental boiling points are sometimes inaccurate due to the presence of impurities, and as a result, have wide discrepancies especially for higher boiling points. This has the propensity to lead to wider and elevated boiling point ranges. For instance, the boiling point of cyclooctane (an alkane) is reported to range from 120.3 to 156.8 • C, and that of methylcycloheptane ranges from 113 to 136.8 • C 2 . The lack of data, and the inaccuracies where data is available, have necessitated the development of boiling point models which can be used to estimate boiling points of chemical compounds for which no boiling point data is available or where inaccuracies exist. A natural foundation on which these models are built are graph parameters 3,4 .
The Wiener index of a graph, introduced by Harold Wiener in 1947 5 , was the first graph parameter to be used in chemistry. In particular, it was used to predict boiling points of alkanes. Ever since its introduction, a lot of effort was invested in research that gave rise to the development of extensions of the Wiener index 3 , and of other indices such as the Hosoya index 3 , the Gutman and the Schultz indices 6 , and a legion of other distance-based topological indices (see, for example 7 ), which are used for the prediction of various properties of molecules.
For alkanes, single variable models that are available are either weak 3 or they consider only a special class of the molecules 1 . Burch, Wakefield and Whitehead 1 successfully developed a single variable model to calculate boiling points of special families of alkanes and developed multivariate models that can predict boiling points of all alkanes up to and including those of order 12. Several other models, see for instance [8][9][10] , abound in literature with the most recent one being due to Sandak and Conduit 11 who trained artificial neural networks to predict the physical properties of linear, single branched, and double branched alkanes. Clearly, as Dearden 2 , pointed out several years ago, modeling a property is easier when one is dealing with a single chemical class. However, from the point of view of an engineer concerned with a wide range of compounds, the methods of greater interest are those that can adequately model the behavior of varied data sets.
In this paper, we are concerned with relating two mathematical objects, associated with a chemical molecule, namely the molecular graph and the numeric value attached to the boiling point of the molecule. We focus on molecular graphs for all alkanes and introduce a new graph parameter which, we define as the 'conduction' . We use this new parameter, the conduction, to develop a single-variable model that predicts the boiling point of any given alkane. We use 1 as the source for the experimental data of boiling points. www.nature.com/scientificreports/ This paper is organised as follows: in the next section, we introduce the new graph parameter, the conduction of a graph, and illustrate how it is computed by determining the conduction of four special classes of graphs representing some important series of alkanes. The model is developed in the "Model development". The main results are presented in "Main results" in which the model is tested. This will be followed by the discussion and the conclusion is "Discussion and conclusion".

The graph parameter: conduction
Consider a connected graph G = (V , E) of order n. The distance d G (u, v) between vertices u and v in G is defined as the length of a shortest path joining u and v in G. For vertex v denote by T v , the breadth-first search tree of G based at v, and let S v be the set of end vertices of T v . We select T v to be the tree that minimizes x∈S v d(v, x) amongst all breadth-first search trees of G based at v. To contrive the conduction parameter, we envisage heat flowing from one atom of a chemical compound to the other atoms. We assume that in the heat conduction process between atoms (depicted by vertices of a graph), heat is transferred from vertex v outwardly to all other vertices in the graph through contact via the edges of the tree T v until it reaches the end vertices in T v . The speed of transfer, conceivably, is governed by the breadth of T v which can be approximated by the square of the degree, deg v , of vertex v. As a result, in the heat conduction process, let the score s(v) of vertex v be the quantity Mathematically, the conduction of a graph G, c(G), is defined as For instance, the graph in Fig. 1 has conduction 366 8 . We use four special classes of graphs. The first is the path, P n , of order n. The broom graph, B n,q , is a graph of order n obtained by taking the path P n−q and attaching q end vertices to one end of P n−q . The second class of graphs we will consider is B n,2 while the third class is B n,3 . These classes of graphs, namely P n = B n,1 , B n,2 and B n,3 , represent the class of normal-alkanes, the 2-methyl, and the 2,2-dimethyl series of alkanes, respectively. The fourth class is that of graphs E n of order n where E n has fairly large conduction among all graphs of order n, and E n consists of a path P, and having as many vertices of degree 4 on P as possible. For instance, E 8 is given in Fig. 1. Table 1 shows some of the graphs, E n . In Tables 1, 2, 3, 4, 5, 6, 7, 8 and 9, c is the conduction.
Using data in Table 1, we can approximate the conduction of E n by an equation of the form where α, β, γ and are constants. Fitting data in Table 1 to Eq. (1), we get the following equation with a coefficient of determination, CoD, R 2 = 0.9924. We discuss in detail the concept of the CoD is "Model quality assessment".

Model development
Model quality assessment. In the development process of our model, we adopt the folkore method, goodness of fit, of determining the quality of the model. After performing a fitting process, it is important to determine the goodness of the fit. While there are many methods of determining the goodness of a fit, in this paper we evaluate the residuals and make a plot of the residuals. The evaluation of the residuals is important in determining the goodness of fit of our model to data. We use the CoD, that is used to explain the variability between our model output and the data. This coefficient is commonly known as R-squared, ( R 2 ). The formula of coefficient of determination is given by: where, R 2 = Coefficient of Determination , RSS = Residuals sum of squares and TSS = Total sum of squares.
A CoD value of 1 indicates a perfect fit, and thus the model can be deemed very reliable for any future forecasts, while a value of 0 indicates that the model does not accurately model the data. Some researchers have argued that it is better to look at adjusted R-squared rather than the R-squared, but for the work presented in this paper, it suffices to use the non-adjusted R-squared.
Problem statement. Let A n be an alkane with n carbon atoms. Our problem is to estimate the boiling point of the alkane, A n . We will represent this by a graph, G n , of order n. We will often use A n and G n interchangeably. We will thus denote the boiling point of A n by b(A n ) or b(G n ).
Model when sufficient data is available. We first consider a situation in which there is sufficient data on the boiling points of a group of alkanes. The general observation is that the conduction values of alkanes of order n have a global linear relationship with the experimental boiling points and there are localised oscillations about the regression line where the diameter of the oscillations decreases as conduction increases. In general, as a first approximation, to find b(A n ) , we propose a model of the form: where c(G n ) is the conduction of graph G n and α and β are constants to be determined by fitting the model to boiling points data of alkanes of order n. To improve the fit and accuracy, we propose a combination of linear fit with logarithmic and trigonometric functions. The logarithmic and trigonometric components are incorporated to capture the oscillatory tendencies. We thus propose a function of the form, where α, β , α i , i = 1, 2, . . . , 6 , are constants to be determined by fitting the model to data of alkanes of order n.
To illustrate the application of (3), we consider the data set depicted in Table 2 of all alkanes of order 6. While a linear model, (2), fits the data with (α, β) = (−1.8245, 96.5461) and a CoD value of R 2 = 0.7903 our model, (3), fits the data with and a CoD value of R 2 = 1 . The graphs of the fit using the least squares (lsq) fitting method and the residuals are given in Fig. 2. It is clear that the proposed function (3) fits well to the data in Table 2 with very low values of the residuals.
(1) c(E n ) = αn 2 + βn + γ + 1 n , c(E n ) ≈ 4 9 n 2 + 3.44857532n − 14.95083692 + 18.25213877 n . on the availability of data for alkanes of order n. In the absence of such data, our initial step will be to develop the data. Once we develop the data, we can then use (2) [or (3), for more accuracy] to find the equation that estimates the boiling point to a reasonable degree of accuracy (namely, coefficient of determination values of R 2 = 0, 7516, 0, 7898 and 0, 6488 for three test data sets, respectively). We cover this below.
Generating data. We need to generate data for some alkanes of order n as more data yields better accuracy during the fitting process. The graphs P n , B n,2 , B n,3 and E n are all alkanes of order n we estimate their boiling points in turn below.
Boiling point for P n . Comparing the boiling points of normal alkanes with their conduction values, we see that growth of boiling points of the normal-alkanes is linear in conduction values but with an additional logarithmic increase added to it. We, thus, propose the model where α, β , α i , i = 1, 2, 3, 4 , are constants to be determined by fitting the model to data of normal-alkanes. From Proposition 2.1, this reduces to Fitting (4) to the data of the first few normal-alkanes given in Table 3, we obtain the values of the constants as b(P n ) = αc(P n ) + β + α 1 log α 2 [α 3 c(P n ) + α 4 ],  The graph and the plot of residuals are given in Fig. 3. The proposed function (4) fits well to the data in Table 3 and the residual values are also very low.
Boiling point for B n,2 . As in the case of normal-alkanes, we propose the model where α, β , α i , i = 1, 2, 3, 4 , are constants to be determined by fitting the model to data of 2-methyl series of alkanes. From Proposition 2.1, this reduces to Fitting (5) to the data of the first few 2-methyl series given in Table 4, we obtain the values of the constants as with a CoD value of R 2 = 0.9996. The graph and the plot of residuals are given in Fig. 4. The function (5) produces a perfect fit to the data in Table 4 (6) to the data of the first few 2-dimethyl series given in Table 5, we obtain the values of the constants as with a CoD value of R 2 = 0.9999. The graph and the plot of residuals are given in Fig. 5.
Boiling point for E n . As in the case of normal-alkanes, we propose the model where α, β , α i , i = 1, 2, 3, 4 , are constants to be determined by fitting the model to data given in Table 1. From Proposition 2.2, this reduces to Fitting (7) to the data given in Table 1, we obtain the values of the constants as with a CoD value of R 2 = 0.9751 . The graph and the plot of residuals are given in Fig. 6.

Main results
From the previous sections, we have now generated data for boiling points of some alkanes with n carbon atoms. Let and We present the data in Table 6.
To find b(A n ) , we propose the lsq model: where n , x 4 = 4 9 n 2 + 3.44857532n − 14.95083692 + 18.25213877 n .  Model testing. We first test the predictive ability of our model, (8), on all alkanes of order 6. We present the results in Table 7. In Tables 7, 8 and 9, Bp in • C means the experimental boiling points in degree Celsius, while predicted means the boiling points predicted by our model. The CoD value is R 2 = 0.7516 and the graphs are presented in Fig. 7 below.
Where constants are as given in (4) B n,2 Where constants are as given in (5) B n,3 Where constants are as given in (6) E n x 4 Where constants are as given in (7)   www.nature.com/scientificreports/ Next we test the predictive ability of our model, (8), on all alkanes of order 7. We present the results in Table 8. The CoD value is R 2 = 0.7898 and the graphs are presented in Fig. 8 below.
We now test the predictive power of our model, (8), on alkanes of order 8. We present the results in Table 9. The CoD value is R 2 = 0.6488 and the graphs are presented in Fig. 9 below.

Discussion and conclusion
In this paper we provide a novel model for predicting the boiling points of alkanes. We consider alkanes of orders 6, 7 and 8 to test the usefulness of our model. We consider alkanes with n carbon atoms represented by a graph G n whose boiling points may be known experimentally. In particular, the methods presented here have the ability to develop the boiling points data using the conduction parameter, which we defined as the conduction of a graph. The nature of the boiling points data necessitated the use of a combination of trigonometric and logarithmic functions in coming up with models that fit to the data. The models are then used to predict the boiling points of alkanes whose number of carbon atoms are known. The model presented in this article is without inculpability. First we fit linear models to experimental data that has very few number of data points in some instances. Second, some boiling points of certain compounds could not be ascertained from literature and last, the approximation of the numerical values of some parameters give rise to variability in the model predictions.
Despite these shortcomings, the model presents some strengths to literature. Firstly, our single-variable model puts to rest the general perception among researchers that boiling points are difficult to predict successfully using a single-parameter model 3,4 . Our key contribution has therefore been the development of a new parameter, the conduction of a graph, which could adequately capture the boiling points. In predicting boiling points, the conduction of a graph has proved to be more superior than previously considered parameters such as the Wiener index, and other commonly used topological indices (see, for example 3 ). In light of this, it will be interesting for future research to see how the conduction of a graph relates, mathematically, to the other indices such as the Wiener index.
Secondly, as Sandak and Conduit 11 puts it, an engineer wants a model that accurately predicts boiling points of the full range of alkanes. Whilst existing models have been successful in predicting boiling points for only a  www.nature.com/scientificreports/ sub-class of alkanes, our model predicts boiling points for the full range of alkanes. As seen above, our model for the considered data sets successfully predicted boiling points with CoD values, R 2 = 0.7516, 0.7898 and 0.6488. Whilst these CoD values are considered satisfactory, one way of improving accuracy of the model is to include more special graphs that generate data for alkanes of order n (see, for instance Table 6).
Received: 13 August 2021; Accepted: 6 December 2021 Table 9. All alkanes of order 8: Predictive ability of model (8).  Figure 9. Shows a comparison of the linear least squares fit to experimental data and the linear model (8) for all alkanes of order 8.