Experiments in modeling recent Indian fertility pattern

Modelling is a well-established concept for understanding the typical shape and pattern of age-specific fertility. The distribution of India’s age-specific fertility rate (ASFR) is unimodal and positively skewed and is distinct from the ASFR of the developed countries. The existing models (P-K model, Gompertz model, Skew-normal model and G-P model considered here) that were developed, based on the experiences of the developed countries, failed to fit the single-year age-specific fertility pattern for India as a whole and for the six selected states. Our study has proposed four flexible models, to capture the diverse age pattern of fertility, observed in the Indian states. The proposed models were compared in three ways; among themselves, with the original models and with the popular Hadwiger model. The parameters of these proposed models were estimated through the Non-Linear Least Squares Method. To find the model with best fit, we used the corrected version of Akaike’s Information Criterion (AICc). Optimization of the four original models was successfully done. When the model was fitted to the empirical data of the 4th round of the National Family Health Survey conducted in 2015–2016, the results of this study showed that all the four proposed models outperform their corresponding original models and the Hadwiger model. When comparison among the proposed models was done, the Modified Gompertz Model provided the best fit for India, Uttar Pradesh and Gujarat. Whereas, the Modified P-K model gave the best fit for West Bengal, Tripura and Karnataka. The Modified G-P model is the most suitable model for Punjab. Although our proposed models illustrated the fitting of ASFR for India as a whole and the selected six states only, it provides an important tool for the policymakers and the government authorities to project fertility rates and to understand the fertility transitions in India and various other states.


Data and methodology
The present study used the data from the National Family Health Survey (NFHS-4), conducted in 2015-2016. The NFHS is a national representative cross-sectional survey conducted for the fourth time in India, as part of The Demographic Health Survey (DHS) Program. Previously, three rounds (1992-1993 24 , 1998-1999 25 and 2004-2005 26 ) have been successfully conducted. The International Institute of Population Sciences (IIPS) conducts the NFHS, with the collaborative efforts of a large group of organizations under the stewardship of the Ministry of Health and Family Welfare (MoHFW), Government of India. NFHS provides major indicators like mortality, fertility, maternal and child health indicators, family planning, child nutrition, domestic violence, morbidity estimates for several diseases etc. at the national and the state level. This is for the first time, that the NFHS-4 has given estimates of major indicators for 640 districts of India (as per the 2011 Census of India). NFHS-4 also provides information on several new and emerging issues, including insurance coverage, use of www.nature.com/scientificreports/ mosquito nets for malaria prevention, abortion, HIV testing during antenatal care, ownership of physical and economic assets by women, and domestic violence during pregnancy. The study sample for NFHS-4 is based on the stratified two-stage sampling of households. A total of 6, 99,686 eligible women of the age-group 15-49 years and 1, 12,122 eligible men of the age-group 15-54 years, were successfully interviewed using the Computer Assisted Personal Interviewing method. For more details regarding the survey design and the questionnaires, one can refer to the NFHS-4 report 27 . Amongst all the fertility indicators, the ASFR secures a crucial place because of its role in forecasting the population by the cohort component projection method. Though, the ASFR estimated from the empirical data is a good portrayal of the existing fertility behaviour of the population, using it for the projection of future population requires an afterthought. Owing to its ability, to pragmatically translate the assumptions about the future fertility behaviour, modeling of ASFR is of considerable interest to the demographers. If someone wants to analyse the mean age of childbearing or wants to know the proportion of births by mothers belonging to the adolescent age group (15)(16)(17)(18)(19) years or by mothers over the age of 35 years, information cannot be directly obtained through ASFR summary measure. There is also much information about the variance, kurtosis, and skewness of the ASFR curve of a particular region. Modelling would be necessary for those conditions. Modelling of age-specific fertility is also necessary to understand the phenomenon of age specific fertility transition among different countries. The foremost aim of any model building is to extract all possible information from the available data and to give an accurate picture of all the known and the unknown facets of the phenomena under study 22 . But, the approximation errors of the fertility model, serve as the primary drivers of deviation in the forecasted population 37 . Hence, it is necessary to identify the correct model for fertility schedule.
In order to simulate the fertility schedule, information on the birth histories of women in the age group of 15-49, is extracted from the NFHS-4. Although 5-year ASFR is reported in the national as well as state reports of NFHS-4, we have computed single-year ASFR for the modelling fertility pattern. The reference period for the calculation of single-year ASFR is 3 years before the survey, i.e., only the births that have occurred within 3 years, before the survey date was utilized to compute single-year ASFR. A total of 6, 98,185 eligible women were considered in our study. For the calculation of all India and state level single-year ASFR, we used STATA tfr2 module as described by Schoumaker 38 .
To find the true nature of the fertility schedule, it is better to choose a single year ASFR than 5-year grouped ASFR. However, single-year ASFR suffers from some fluctuations and some digit preferences at certain ages. To detect the correct trend of fertility patterns suffering from the problem of fluctuations, we have applied the smoothing technique. Here, we used loess (locally estimated scatterplot smoothing) techniques to smoothen the scatterplot. The advantage of loess smoothing is that it is a bit more flexible than the other smoothing techniques. It does not assume any particular mathematical form, to smoothen the data but allows them to discover the desired form from data itself. We have implemented loess ( ) function in R with 0.2 as the value of the smoothing parameter (20% smoothening). The reason for choosing the parameter equal to 0.2 is that, it is neither too small to eliminate fluctuations in the data nor too large to over-smoothen the scatter plot. In this study, we try to fit the fertility pattern of India and the 6 selected Indian states with four proposed models. For this purpose, based on the socio-cultural similarities, we have classified the Indian states into six geographical regions, depending on their locations; north, central, east, northeast, west and south. And from each of these six regions, one representative state was selected randomly (using random number generation). The region wise classification of states and the 6 selected states from their respective region is given in Table 1. The age-specific fertility pattern of India and the six selected states (Punjab, Uttar Pradesh, West Bengal, Tripura, Gujarat and Karnataka) are obtained from NFHS-4.
In order to evaluate our proposed model in comparison to the Hadwiger model, we have fitted Hadwiger model to the empirical fertility of India and the selected states and them compared the estimates. The functional form of the Hadwiger model is as follows: Here, f(.) is the ASFR at age x and a, b, c are the three parameters. Chandola et al. 3 has discussed the demographic interpretation of these parameters. Parameter a is related to the total fertility, b is associated with the height of the ASFR curve and parameter c is associated with the mean age of childbearing whereas the term ab c is associated with the extent of extreme fertility.
In our model, a handles fertility in the late ages of the reproductive span. All the other parameters have the same interpretation as explained above. If we put a = 0 in the Modified P-K Model, it reduces to a model proposed by Peristera and Kostaki 10 .

Model 2 (Modified Gompertz Model
). The second model that is proposed is an extension of Gompertz model, in which we have added one scale parameter γ . The functional form of Gompertz model is taken from Hoem et al. 6 and Mathivanan et al. 39 , which is given as follows: The three parameter Gompertz model is not quite flexible. Murphy and Nagur 20 showed that it has fixed kurtosis and therefore, it does not fit well to India as well as to the states' ASFR curves. After inclusion of scale parameter γ , the curve fits very well. The proposed model is defined below: m is defined as the lowest marriage age of women in the population. Our model will reduce to Gompertz model as soon as we put γ = 1 in the above model. where ϕ and Φ are the density and the distribution function of the standard normal distribution respectively. λ is the location parameter and σ is the scale parameter. δ is the skewness parameter. The detailed property of this function was first studied by Azzalini 40 . In the modified version of this model, the skewness of the ASFR curve is controlled by δ and the two scale parameters s 1 and s 2 .The proposed model is defined as follows: Here, θ reflects the peaked value of the age specific fertility, λ being the age at which the peaked fertility is being achieved. s 1 and s 2 represents the deviation of the ASFR curve, before and after the peaked fertility respectively.

Model 4 (Modified G-P Model)
:. The fourth model that we have proposed is a variation of the behavioral model, suggested by Gupta and Pasupuleti 41 . The original form of G-P Model 24 is as follows: Here, f(.) is the ASFR at age x, parameter q being responsible for the amplification in the level of the fertility, due to the 'exposure to marriage and sexual union' . Parameter r is responsible for the decline in the level of fertility due to the 'birth control measures and the biological inability to reproduce' . This model overestimates the early age specific fertility and underestimates peak age specific fertility. Hence, we have introduced a constant parameter Estimation of parameters of the models. For the estimation of the proposed models, we have the applied Non-Linear Least Squares Method. Under this method, we have minimized the sum of squares due to error (SSE), with respect to the parameters of the model. SSE function is given by: Here f(x) is the observed ASFR at age x and f (x) is the estimated ASFR from the proposed models. We have used nlminb( ) function in R, for non-linear minimization of SSE function. nlminb( ) function estimates the parameters, along with their asymptotic standard errors. In this study, we did not report the asymptotic standard errors as per the convention 10 .
Performance of the models. While modelling the ASFR, it is necessary to check the performance of different models. Usually, one can plot the predicted values and the observed values of the ASFR from the model and then choose the best fit among all the models. In quantitative terms, SSE should be calculated for each model and the one with the least SSE will be the best fit model. However, this will not take care of the model complexity, resulting from the inclusion of more parameters. Hence in order to keep the model fit and to consider its complexity, we have used corrected Akaike's Information Criterion (AICc) 42 , to select the best performing model in this study. AICc is Akaike's Information Criterion (AIC) 43 adjusted for small sample sizes. AIC for a particular model adjusts the SSE, with the number of parameters and the number of observations in the model. The formula of AICc is as follows: Here, k denotes the number of parameters present in the model and n denotes the number of observations. As n → ∞ , the results of the AICc converge to AIC. The criteria is to choose the model with the least value of AICc, as the best performing model.

Results
Age-specific fertility schedule of India and six selected states. Figure 2 shows the single year ASFR curves for India and the six selected states (Punjab, Uttar Pradesh, West Bengal, Tripura, Gujarat and Karnataka). All the ASFR curves are unimodal and skewed towards the right (positively skewed), i.e. age-specific  . However, the peak of the fertility pattern of Gujarat remains somewhat constant between 21 to 25 years. Tripura has the highest age-specific fertility at the early ages, whereas, Punjab has the lowest level of fertility at the early ages. Uttar Pradesh has the highest level of fertility in the later ages (45-49 years), while Punjab has the lowest level of fertility in the later ages too. It is also observed from the figure, that although Karnataka and West Bengal have approximately the same level of TFR, their ASFR is very much different in shape. From the age of 15-20 years, West Bengal has a higher level of fertility than that of Karnataka; in the later ages, Karnataka has a higher level of fertility than that of West Bengal. The rate of increase from the start of the curve to its peak is highest for West Bengal and least for Punjab.
Performance of the proposed models with respect to the original models. Table 2 displays the performance of the Modified P-K Model vs. P-K model in terms of SSE and AICc values for India and the six selected states. The P-K model has 4 parameters, whereas the Modified P-K Model has five parameters. By comparing SSE and AICc values of both the models, we have found that the Modified P-K Model is working better only for India, Uttar Pradesh and Gujarat. The SSE values of both the models of Tripura are approximately the same. But the P-K Model is preferred over the Modified P-K Model, because it has less number of parameters. Likewise, Table 3 Table 5, it can be seen that the Modified G-P Model has the least SSE and AICc value, as compared to the G-P model in India and all the six selected states. It means that the Modified G-P Model provides better results than the G-P model, for fitting the fertility patterns of India and the selected states. Figures B1 to B4 in the Appendix B, show the observed and the fitted age-specific fertility pattern, under the proposed model and the existing model under study. We have shown these graphs only for Indian fertility schedules.
Tables India's fertility pattern is best predicted by the Modified Gompertz Model, as it provides the best fit among all the other models under comparison (Table A1 and Fig. 3). The Modified Gompertz Model also gives the best fit for the fertility pattern of Uttar Pradesh (Table A3 and Fig. 5) and Gujarat (Table A6 and Fig. 8). The    (Table A4 and Fig. 6), Tripura (Table A5 and Fig. 7) and Karnataka (Table A7 and Fig. 9). On the other hand, the fertility pattern of Punjab is best fitted by the the Modified G-P Model amongst all the other models (Table A2 and  www.nature.com/scientificreports/ We have also explored the possible demographic interpretation of the new parameters added in the proposed models. There is a strong association between the later ages (45-49 years) fertility and the constant parameter a of the Modified P-K Model. The correlation coefficient between these two (r = 0.857) are significant. We have also reviewed the demographic interpretation of the parameters that were already present in the previously existing models. The parameter b in the Modified P-K Model is significantly related to the extent of peak fertility (r = 0.988). Parameter μ in the Modified P-K Model is related to the women's age at extreme fertility, whereas σ 1 and σ 2 describes the spread of the ASFR curve before and after the peaked fertility respectively. All the fertility schedules are skewed towards the right. For the Modified Gompertz Model, we find that parameter γ is strongly related to the extent of extreme fertility and the term is significantly correlated (r = 0.998) to the total fertility rate. Notwithstanding, we could not test for the association of parameter m with the lowest age of childbearing in the Modified Gompertz Model. There is a great resemblance between the parameters θ and λ from the Modified Skew Normal Model and b and μ from the Modified P-K Model. Both the estimates are quite close and hence have similar interpretations. θ is strongly associated with level of extreme fertility (r = 0.985) whereas λ is related to the age at modal fertility. The parameters s 1 and s 2 of the Modified Skew Normal Model represents the spread of the ASFR curve. Estimates of the parameter δ in each case was found to be positive under the Modified Skew Normal Model. It means that ASFR curve is positively skewed. The parameter s in the Modified G-P  Parameter q is strongly associated with the age of women at extreme fertility (r = 0.968). However, we could not find reasonable interpretations for the parameters p and r of the Modified G-P Model. Further, parameter a in Hadwiger model is significantly associated with the total fertility (r = 0.997). The other parameter c in Hadwiger model is closely related to the mean age of childbearing However, the interpretation of the parameter b in the Hadwiger model as given by Chandola et al. 10 could not be justified in terms of the height of the fertility curve.

Discussion
India is a vast country comprising 28 states and 8 union territories, with wide demographic and socio-cultural diversity. Many of the Indian states have higher population than several countries of the world. Comparing the fertility schedules of the Indian states is an extensive exercise. In comparison to the previous studies 7, 29 done on the modelling of Indian fertility schedule, our study paid more attention to model with the single-year ASFR of India and the six states. Our study is also distinct from the Pandey et al. 28 , as it proposes four parametric models for single-year ASFR. Pandey et al. 28 also fitted the data of single-year ASFR through polynomial functions. Results of our study conclude that the Modified P-K Model performs better than P-K model only if the tail of the fertility curve has some amount of fertility. P-K model provides best fit for the countries with non-enhanced   www.nature.com/scientificreports/ feature of the fertility curve of West Bengal and Tripura is that, both have a high level of fertility in the early ages (15)(16)(17)(18)(19) and the Modified P-K Model gives a considerable amount of fertility in the early ages. While an experiment of Hoem et al. 6 to fit Gompertz curve to Danish fertility schedules for the calendar years 1962-1971 was found to be unpromising, the Modified Gompertz Model provided best fit for India (moderate level TFR = 2.2), Uttar Pradesh (High-level TFR = 2.8) and Gujarat (Low-level TFR = 2.0), i.e. it can reproduce all types of fertility schedules, where the total fertility is greater or equal to two. However, it is observed that it works better for those fertility curves, where the decline in the level of fertility after the peak is not so rapid.
Mazzuco and Scarpa showed that the skew-normal density is quite good for modelling the unimodal fertility curve. Skew-normal density has outperformed Gamma, Hadwiger and P-K model for the United States fertility data of 1963 11 . Our study has confirmed that the Modified Skew Normal Model has worked better than the original skew-normal model for India and all the other six states. Gupta and Pasupuleti have examined the applicability of G-P model on the cohort fertility schedules of India, Finland, Sweden, Netherlands, Switzerland and USA and the period fertility schedules of Bhutan, Iran, Somoa and Algeria 41 . The results of their study concluded that the fit was inferior to Gamma and Hadwiger model, but superior to the Gompertz model. However, the differences were marginal 41 . The Modified G-P Model provided better fitting than that of G-P model for the fertility schedule of India and all the six states considered in our study.
Although Punjab also has a low level of total fertility, the Modified G-P Model has provided the best fit. The reason behind the failure of the Modified P-K Model, for the best fit of Punjab might be that, the ASFR curve of Punjab has a shallow level of fertility in the early and the later ages. The inclusion of constant parameters in the Modified P-K Model and the Modified G-P Model, gives more weights of fertility in the tail of the fertility curve,  show that the Hadwiger model overestimates the early and the age-specific fertility around the peak. Though the attempt to fit the proposed models to the recent single-year ASFR curve has met some success, there is great scope of work for other researchers to find the suitability of these models for grouped and order-specific ASFR and the past fertility patterns of India and its varied states.
Most of the studies 7,10,16,22 have either used AIC or residual sum of squares criterion to evaluate the performance of the proposed models. However, our study has used the corrected version of AIC criterion (AICc). Burnham and Anderson 44 suggested that, AICc should be preferred over AIC when the ratio of the number of observations (n) and the number of parameters (k) is less than 40. In our study, the number of observations is 35 and the number of parameters is either 3, 4 or 5. So in each case, the ratio will be less than 40. Hence, the use of AICc is justified here. This also makes our study distinct from other studies.
It has been a subject of great discussion, whether the period or cohort ASFR should be adopted for the illustration of the model fit. Our results were based on the period fertility only. However, there is a lot of scope for researchers to fit our proposed models on cohort fertility. We have considered period data, since period data is easily available for analytical purpose and it also enables us to evaluate and examine the influence of recent demographic events and targeted interventions. On the other hand, cohort data either suffers from attrition or difficulty to reconstruct. Additionally, cohort data is concerned with the long term development, as cohort trends would be observable only if it is followed up for a longer period of time.

Conclusion
The models that we have proposed are extensions to P-K Model, Gompertz model, generalized skew-normal function and Gupta and Pasupuleti model. These proposed models were compared among themselves and with their previously existing forms. Results of this study suggests, that all the proposed models are quite flexible and are able to produce a close fit to the variable type of the observed fertility pattern. It is pertinent to mention, that all the proposed models are more efficient than the original models considered in this article and the Hadwiger model, in terms of the goodness of fit for fertility schedules of India and its states. The demographic interpretation of the parameters of the model is also explored in this study. It is evident from the demographic interpretation of the parameters that the demographic profile of the Indian states is much more diversified. Although our proposed models have illustrated the fitting of fertility schedules of India on the whole and for the six states only, it was intended to provide an important channel, for the policymakers and the government authorities, to project the fertility rates and to understand the fertility transitions in India and its various states.