## Introduction

One of the leading challenges in the field of aging research is measuring age accurately. Monitoring healthy individuals for decades to assess whether an intervention affects the aging process is prohibitive in terms of time and funding. The creation of epigenetic clocks, age predictors that use DNA methylation data, has given researchers tools to measure the aging process quantitatively. Moreover, recent works have demonstrated the effectiveness of precise epigenetic editing based on CRISPR with targeted DNA methylation or demethylation1. Consequently, epigenetic clocks have the potential of not only measuring aging but also guiding epigenetic interventions.

Notably, two of the most well-known predictors are the ones developed by Hannum et al.2 and Horvath3 in 2013. Hannum et al. developed a blood-based epigenetic clock with 71 CpG sites2. Then Horvath showed epigenetic clocks could also accurately predict age across tissues, developing a predictor with 353 CpG sites3. Horvath’s model has been widely used as it is seen as the state-of-the-art pan-tissue epigenetic clock for humans4,5,6,7. Both of these works used simple regularized linear regression (ElasticNet) for feature selection and prediction8. More recent epigenetic clocks that predict mortality also use a linear combination of features9,10. ElasticNet has been widely used to develop epigenetic clocks2,3,9,10,11,12,13. Nevertheless, simple linear regression can display high bias and cannot capture non-linear feature-feature interactions in the data.

Interactions among variables can be taken into account by expanding the feature space with feature multiplication. However, incorporating pairwise CpG site interactions is unfeasible given the high dimensionality of the DNA methylation data. Horvath’s model3 selected 353 CpG sites out of total 21,368 sites. If the linear regression had taken into account all pairwise interactions, the feature space would grow to over 228 million. A large number of features is especially challenging due to the relatively low number of publicly available DNA methylation samples. Given the complexity of the epigenetic regulatory network, it is likely that important interactions among CpG sites are not captured in the current epigenetic clocks developed thus far.

Deep learning models have been successfully applied to transcriptomic and clinical blood biomarker data for age prediction14,15. For DNA methylation data, Galkin et al. recently showed that a deep neural network model, DeepMAge16, gave slightly better prediction performance than Horvath’s model in blood samples. However, the authors compared Horvath’s pan-tissue predictor to a model trained only in blood DNA methylation data. Moreover, there was no in-depth exploration of why their deep learning model outperformed the ElasticNet model. Similarly, Levy et al.17 developed a deep learning framework to work with DNA methylation data that encodes the CpG sites into latent features for downstream analysis. They showed encouraging results for age prediction using a multi-layer perceptron; however, they investigated only one data set obtained from white blood cells. Therefore, currently, our understanding of the advantages of neural networks for this task in a pan-tissue setting is limited.

We introduce AltumAge, a deep neural network that uses beta values from 20,318 CpG sites common to the Illumina 27k, 450k and EPIC arrays for pan-tissue age prediction (summarized in Fig. 1a). We hypothesized that a neural network using all available CpG sites would be better suited to predict pan-tissue age using DNA methylation data due to their ability to (1) capture higher-order feature interactions and (2) leverage important information contained in the thousands of CpG sites not selected by ElasticNet models. AltumAge uses multi-layer perceptron layers (similar to16,17) that account for non-linear interactions by combining multiple features into each node of the network. We trained AltumAge on samples from 142 different experiments, which, to our knowledge, is the largest compilation of DNA methylation data sets for human age prediction. The publicly available data were obtained from multiple studies that used Illumina 27k and Illumina 450k arrays. The code for the model can be found in our GitHub repository (https://github.com/rsinghlab/AltumAge).

We show that AltumAge has a significantly lower error for within-data set age prediction, is better able to generalize to new tissue types and older ages for cross-data set settings, and is more resistant to noise than ElasticNet. For inference, we apply the Shapley-value-based interpretation method, called SHAP18, on AltumAge to determine the contributions of different CpG sites towards age prediction (summarized in Fig. 1a). We confirm that the most important CpG sites have complex interactions resulting in non-linear relationships when predicting age. Such interactions may lead to mechanistic hypotheses on how the epigenetic network interacts to drive the aging phenotype. Additionally, we find that the most important CpG sites are proximal to CTCF binding sites. However, CpG sites in known age-related pathways (SIRT, mTOR, and AMPK) do not seem relevant for age prediction. Finally, our downstream analysis reveals that AltumAge predicts higher age for tumors, cells with cellular immune dysfunction, cells with mitochondrial dysfunction, cells with high passage numbers, and samples with multiple sclerosis, type 2 diabetes, HIV, among others, than controls demonstrating its usefulness for age-related studies. Overall, we show that deep learning can improve both the performance and interpretation over the widely used ElasticNet-based models and present AltumAge as a useful tool for age prediction.

## Results

### Performance Evaluation

#### Model selection

Neural networks can capture complex variable interactions when provided with a large number of high-dimensional data sets. We hypothesized that the same would be true for age prediction with DNA methylation data.

For each of 142 data sets, we split the total samples—60% for training and validation (n = 8050) and 40% for testing (n = 5455)—to avoid introducing any bias in the age, gender, and tissue type distributions. In the training and validation set, the data was further subdivided by data set, with 85 (n = 4394) for model selection and 57 (n = 3656) for validation. Supplementary Figure 1a shows a schematic of the division of the data, with the full list of data sets used in Supplementary Note 1. Several machine learning models were trained and validated (Supplementary Table 1, Supplementary Figure 2a) to pick the best performing models according to the mean squared error (MSE). We also included other evaluation metrics such as the median absolute error (MAE) and Pearson’s correlation coefficient (R). Our primary baseline method is the ElasticNet model, a linear model with implementation following Horvath3. We also included two traditional machine learning methods that capture non-linear relationships in the data, random forest and support vector regression; however, they performed poorly on the validation set (MAE = 6.206 and 10.903, respectively). In addition, we tried the neural network TabNet, an attentive interpretable tabular learning method19. Although the MSE was marginally better than the ElasticNet baseline (MSE = 62.113 vs. 67.981), the MAE was slightly worse (MAE = 4.027 vs. 3.708). Lastly, we tuned the hyperparameters of the neural networks based on recent findings that highly regularized deep learning methods excel in tabular data prediction20. The best neural network model based on both the MAE and MSE was dubbed AltumAge (from the Latin altum, meaning “deep”). For clarity, a regularized linear regression model with Horvath’s log-linear age transformation, trained on our 142 data sets, and using the built-in hyperparameter tuning from the Python glmnet will be referred to as ElasticNet. On the other hand, the application of Horvath’s original 2013 epigenetic clock, originally trained on 39 data sets in that paper3, will be referred to as Horvath’s model.

#### AltumAge outperforms linear models for within-data set age prediction

Differences in performance among epigenetic clocks can generally be explained by three factors: the DNA methylation data, the model, and the input CpG sites (or the features). For within-data set age prediction, all of the training and validation set (n = 8050) was used to train all of the models while the model and input CpG sites were varied. Supplementary Figure 1b displays a simple schematic of the within-data set age prediction. Main results are presented in Table 1 (detailed in Supplementary Table 3), with a qualitative comparison in Supplementary Figure 2b.

We hypothesized that our large and diverse DNA methylation data might improve performance compared to other epigenetic clocks irrespective of model type, adding a confounding variable to any performance improvement seen with AltumAge. To understand the magnitude of such effect, we compared a replication of Horvath’s model as seen in3 with a linear regression trained on our 143 data sets using the same set of 353 CpG sites. Indeed, the regression trained with our data has a lower error (MAE = 3.011 vs. 3.530; MSE = 46.867 vs. 71.031). ElasticNet, with its selected 903 CpG sites trained with our data, further improves the performance (MAE = 2.621, MSE = 39.198). This result shows that a larger training data set helps the age prediction performance.

Next, we aimed to determine whether the model type, i.e., a linear regression vs. a neural network, would significantly impact the performance. We, therefore, compared the aforementioned linear models with the neural network AltumAge using the same set of features. AltumAge outperformed the respective linear model with Horvath’s 353 CpG sites (MAE = 2.425 vs. 3.011, MSE = 32.732 vs. 46.867) and ElasticNet-selected 903 CpG sites (MAE = 2.302 vs. 2.621, MSE = 30.455 vs. 39.198). This result shows that AltumAge outperforms linear models given the same training data and set of features.

Lastly, to compare the effect of the different sets of CpG sites, we trained AltumAge with all 20,318 CpG sites available and compared the results from the smaller sets of CpG sites obtained above. There is a gradual improvement in performance for AltumAge by expanding the feature set from Horvath’s 353 sites (MAE = 2.425, MSE = 32.732) to 903 ElasticNet-selected CpG sites (MAE = 2.302, MSE = 30.455) to all 20,318 CpG sites (MAE = 2.153, MSE = 29.486). This result suggests that the expanded feature set helps improve the performance, likely because relevant information in the epigenome is not entirely captured by the CpG sites selected by an ElasticNet model.

Overall, these results indicate that even though more data samples lower the prediction error, AltumAge’s performance improvement is greater than the increased data effect. Indeed, the lower error of AltumAge when compared to the ElaticNet is robust to other data splits (Alpaydin’s Combined 5x2cv F test p-value = 9.71e−5).

A direct comparison of AltumAge and Horvath’s model reveals that AltumAge has fewer tissue types with a high MAE. In his 2013 paper, Horvath noticed poor calibration of his model in breast, uterine endometrium, dermal fibroblasts, skeletal muscle, and heart3. In our test data, a similarly poor predictive power was found for these tissue types for Horvath’s model (breast MAE = 9.462; uterus MAE = 5.798; fibroblast MAE = 10.863; muscle MAE = 9.047; heart not included). AltumAge, on the other hand, had much lower errors for them (MAE = 3.681, 3.660, 4.089, 2.540 respectively). Furthermore, Horvath’s model had an MAE > 10 years in 22 tissue types in the test data. AltumAge, on the other hand, had an MAE > 10 in only three tissue types.

Supplementary Figure 3 in particular shows how AltumAge, in contrast to Horvath’s model, does not underestimate older ages (>60 years) to such an extent (median error = −2.808 vs. −4.677). Better performance in older age is fundamental in defining biomarkers of age-related diseases of which age is the biggest risk factor. Horvath’s model tends to underestimate such population partly due to CpG saturation (beta value approaching 0 or 1 in certain genomic loci)21. Another reason might be the assumption that age-related CpG changes are linearly correlated with age after 20 years of age. AltumAge resolves these two problems by incorporating an expanded feature set and not using any pre-defined age transformation function that might inject bias in the data processing.

Of note, we were unable to compare AltumAge with DeepMAge16, another deep learning framework. Unfortunately, neither the code for DeepMAge nor a complete description of its architecture is available.

#### AltumAge is more generalizable than ElasticNet in older ages and in non-blood tissue types

Leave-one-data-set-out cross-validation (LOOCV) provides a way to understand the generalization potential of a model to new unseen data sets. We performed this LOOCV analysis by leaving out the training samples of each data set (out of the 143) during model fitting. Therefore, the model training was performed using the training set of 142 data sets. Next, we evaluated the performance of this model on the test set of the left-out data set. Consequently, we trained 142 different models in total to evaluate the LOOCV performance for all 142 data sets (Table 2, Supplementary Figure 2c). Supplementary Figure 1c displays a simple schematic of data flow for the LOOCV age prediction.

Since AltumAge uses 20,318 CpG sites, we expected it to be more prone to noise and overfitting than a model with low variance such as ElasticNet, which effectively uses only a subset of CpG sites. Nevertheless, we see that AltumAge performs better than ElasticNet in MSE (63.022 vs. 67.317, Wilcoxon signed-rank test p = 0.0018) and has a slightly lower MAE (MAE = 3.194 vs. 3.204, Wilcoxon signed-rank test p = 0.04).

Next, we analyzed the results stratified by absolute error, age, data set, and tissue type. Fig. 2a shows a scatter plot of the LOOCV absolute error for each sample according to AltumAge and ElasticNet. Points above the black line favor ElasticNet while the opposite favors AltumAge. As shown by the 100-window rolling mean line, for samples with an absolute prediction error > 3.457 years, on average, AltumAge performs better. This observation is particularly apparent for large deviations. This result indicates, alongside the lower MSE, that AltumAge is more resistant to outliers than ElasticNet when generalizing to new samples.

Stratifying the results by age can give insights into particular strengths and weaknesses of each model. For example, while both models capture the age-related epigenetic drift given the correlation between absolute error and age (AltumAge Pearson’s R = 0.443; ElasticNet Pearson’s R = 0.459), AltumAge performs better on average for samples with age > 59 years (Fig. 2b). This result suggests that AltumAge better captures epigenetic changes during aging while ElasticNet better understands the developmental epigenome, since epigenetic changes during childhood and puberty are related to development but after they are mostly due to aging22.

Finally, we analyzed the performance of each model by data set and tissue type. As shown in Fig. 2c, AltumAge performs better than ElasticNet in the LOOCV MAE for data sets with a higher error in performance, i.e., that are more challenging to predict. In comparison, ElasticNet is superior for the ones with a lower error. We observe that most data sets with a low MAE are from newborns or blood samples, and the training set is skewed towards blood-based samples (see Supplementary Figure 4). Therefore, we hypothesized that ElasticNet may be simply performing better for overrepresented tissue types in the training set. To check this, we looked at the tissue types from data sets for which ElasticNet had a 50% worse LOOCV MAE than AltumAge (capturing a large deviation) and vice versa. As expected, ElasticNet does not generalize as well to a large variety of tissue types (Fig. 2d). At the same time, it performs better in blood-based samples (Fig. 2e). These observations imply that AltumAge can better generalize to more tissue types, likely capturing global age-related epigenetic patterns, while ElasticNet could be focusing primarily on blood changes.

As each model has its benefits and drawbacks, we checked the performance of an ensemble of both methods. Interestingly, we observe a substantial decrease in both MAE (2.934) and MSE (58.986) by averaging the predictions of both models. These results indicate that combining deep learning and linear model predictions may further improve the age prediction performance.

#### AltumAge is more robust to noise than ElasticNet

Another desired property of epigenetic clocks is reliability. Noise derived from the experimental procedure, biological or technical replicates may negatively influence the model’s reliability. AltumAge was trained with Gaussian noise and adversarial regularization to be more robust against random variation23. Gaussian noise introduces normally distributed fluctuations in between hidden layers. Adversarial regularization includes artificial observations with subtle modifications in the loss function that attempt to fool the model into increasing the error. To assess the robustness of AltumAge and ElasticNet to noise, we gradually added artificial Gaussian noise in the beta value of each CpG site up to one standard deviation in the within-data test set and tracked MAE (Fig. 3a) and MSE (Fig. 3b). As expected, the error grows much faster in the ElasticNet model, particularly with the MSE, which is more swayed by outliers.

Furthermore, we examined an independent whole blood data set GSE55763 (not used in training or testing), which contains 2 technical replicates for each of its 36 samples. Ideally, the difference in prediction between the replicates would be zero. As shown in the histogram in Fig. 3c, the median absolute difference for AltumAge is 1.537 years, whereas for ElasticNet, 1.719 years, while the maximum absolute difference is 4.756 and 5.862 years, respectively. Despite no significant difference in distributions–likely due to the small sample size–the models differ in whether they capture an artifact effect between replicates. As anticipated, we do not observe a statistically significant effect from replicate one to two for AltumAge (linear mixed-effects p = 0.720). However, we see that ElasticNet predicted a higher age of 1.130 years for replicate two (linear mixed-effects p = 0.002). Overall, the results highlight that resistance to random noise may help in real-world scenarios, increasing model robustness and reliability.

### Inference

Neural networks, particularly in the context of deep learning, used to be seen as “black-box” methods, as their interpretability was difficult. Regardless of the predictive power of ElasticNet models, they are easily understandable. Recently, various methods have been proposed to extract the contribution of features towards prediction in neural networks. They include interpretation based on model gradients24,25,26, attention27, among others. One such inference method is SHAP18, which uses a game-theoretic approach to aid in the explanation of machine learning methods. It can measure how one feature contributes to the output of deep neural networks. For our case, the SHAP value can be conceived as how much the value of one CpG site affects the age output of the model in years. Through the architecture of neural networks, it can also determine which CpG sites most highly interact with each other.

We present results for model inference using SHAP that assist with understanding AltumAge. To support the results obtained by SHAP, we also applied another method of determining feature importance called DeepPINK28 (see Supplementary Information).

#### AltumAge captures relevant age-related CpG-CpG interactions

Epigenetic modifications can significantly influence gene expression. They can also impact genes that affect other epigenetic changes. Therefore, some CpG sites interact with others through the gene expression network and can work in tandem. Through SHAP, we show that AltumAge can measure how hyper- or hypomethylation of secondary CpG sites affects the relationship of a CpG of interest and age. Supplementary Figure 5 shows scatter plots of the nine most important CpG sites based on SHAP-based importance values assigned to the CpG sites of the samples in the test set. These nine CpG sites account for 1.83% of the total model importance (Supplementary Figure 6). The dependence plots show both the relationship of a CpG site with the predicted age and how that relationship can be affected by the value of a secondary CpG site for a DNA methylation sample. This secondary CpG site has the highest interaction with the CpG of interest, as determined by SHAP values. One way to understand the effect of the secondary CpG site is to focus on the samples in the top and bottom deciles of its methylation value, looking for any differences that may arise due to hyper- or hypomethylation respectively. We categorized three different types of relationships between CpG site methylation value and age: (1) completely linear, which are independent of CpG-CpG interactions; (2) bivalently linear, whose slope is dependent on a secondary CpG site; and (3) non-linear, affected by a secondary CpG site.

Out of the top nine CpG sites, only cg04084157 (Fig. 4a), the fourth most important, shows an almost completely linear relationship. We subdivided the samples in the test set into the top and bottom deciles for cg22736354, the most highly interacting CpG site. Both subsets display linear relationships despite its heteroscedasticity or unequal scatter (Fig. 4b), though with slightly different regression coefficients (Z-testFootnote 1p = 1.75e−14). The main interaction effect appears to be that when cg22736354 is hypermethylated, so is cg04084157, and when cg22736354 is hypomethylated, so is cg04084157. Of note, we repeated the aforementioned analysis for all top 1000 CpG sites but did not find a single CpG site without a statistically significant interaction, i.e., a difference in the linear regression coefficients for the top and bottom deciles with p > 0.05.

The effect of cg10523019 (Fig. 4c), the fifth most important CpG site, displays a bivalently linear relationship. The regression coefficient when cg19722847, the most highly interacting CpG site, is hypomethylated (coef = 0.653) is more than twice when it is hypermethylated (coef = 0.290, Z-test p < 1.0e−300, Fig. 4d). This dual response may also shine a light on relevant age-related biological processes. cg19722847 is located in the gene IPO8, a gene that participates in nuclear transport, and cg10523019 lies in RHBDD1, a gene involved in proteolysis and apoptosis. IPO8 is known to interact with transcription factors NF-κB and FOXO3 by allowing their nuclear transport and hence activation29,30. Indeed, RHBDD1 is a known target of FOXO331. The methylation status of cg10523019 becomes less relevant as transcription of RHBDD1 may become deficient due to lack of IPO8–and nuclear FOXO3–regardless of RHBDD1 methylation. Laboratory experiments would have to be performed to more thoroughly characterize these relationships; however, it is possible to obtain data-driven hypotheses from these dependence plots.

An example of a non-linear CpG-age relationship comes from the seventh most important CpG cg12373771 (Fig. 4e). When cg07099407, the most highly interacting CpG site is hypermethylated, the relationship between cg12373771 and age output is still linear. However, when it is hypomethylated, it becomes slightly non-linear (Fig. 4f). While Pearson’s correlation coefficient for a straight line is high (0.961), the residual plot (Supplementary Figure 7) shows underestimation at the boundaries with overestimation in the center. This pattern demonstrates non-linearity; a cubic regression corrects the bias of under and overestimation and increases the correlation coefficient to 0.980. Overall, our results using SHAP values demonstrate that AltumAge captures the non-linear interaction between CpG sites.

Note that despite their important effects on the predicted age, some of the CpG sites that interact with the most important CpG sites may themselves not be particularly relevant for the output. For example, the aforementioned cg07099407, the CpG with the highest interaction with the seventh most important CpG site, ranks 6315 and 3975 according to SHAP and DeepPINK, respectively, out of 20,318. These results suggest that ElasticNet may miss DNA loci that regulate other loci in aging, and this may partly explain AltumAge’s performance improvement compared to ElasticNet.

#### Characterization of CpG sites by model interpretation

CCCTC-Binding factor (CTCF) is a transcription factor involved in the negative regulation of several cellular processes. It also contributes to long-range DNA interactions by affecting chromatin architecture. We examined whether CpG sites with a higher SHAP importance were closer to CTCF binding sites. The 353 important CpG sites selected by Horvath’s model were not closer to the CTCF binding sites when compared to the 21,368 control CpG sites from which the paper’s ElasticNet model was trained (Mann-Whitney U-test p-value = 0.991). As for AltumAge, since it uses all of the 20,318 CpG sites as features, we compared the top 903 CpG sites to the control, as the ElasticNet model applied on the full data set selects 903 sites as important. These sites comprise 45.3% of SHAP importance. In line with previous studies32,33,34, we find that the selected important CpG sites are overwhelmingly closer to CTCF binding sites (Mann-Whitney U-test p-value = 0.041, Supplementary Figure 8). This observation suggests that epigenetic alterations proximal to such loci that are involved in chromatin packing by affecting CTCF binding may be captured by AltumAge. This result is relevant because chromatin structure modifications have been associated with aging (see review35).

Due to the close relationship between chromatin and aging, we hypothesized that different chromatin states would influence the importance of each CpG site. ChromHMM is a Hidden Markov Model used for the characterization of chromatin states36. Annotations for several cell lines and tissue types are widely available online. Since AltumAge is a pan-tissue epigenetic clock, we used the mode of the 18-state annotation from 41 different tissues obtained from ENCODE for each CpG location37 (Supplementary Figure 9, Supplementary Table 2). CpG SHAP importance values are indeed impacted by ChromHMM state (Kruskal−Wallis H-test p-value = 2.03e−78). The chromatin states with the highest and second highest SHAP normalized median importance were ZNF genes and repeats (SHAP importance = 1.68e−03%, top 73th percentile of all CpG sites) and heterochromatin (SHAP importance = 1.68e−03%, top 58th percentile). Of note, the chromatin states with the highest DeepPINK normalized median importances were also heterochromatin (most important) and ZNF genes and repeats (second most important). Heterochromatin is particularly interesting as several interventions that increase chromatin packing also extend lifespan35. The full ChromHMM results can be found in Supplementary Note 2.

#### Aging-related pathways

One of the main interpretation advantages of AltumAge compared to ElasticNet is that the former effectively uses a much larger feature space. CpG sites in aging-related genes are often not selected within the couple hundred features of an ElasticNet model, thus making analyses of these CpG sites of interest impossible. AltumAge allows a closer look at the relationship of CpG sites in aging-related pathways even when these CpG sites are not particularly important for the final age prediction. It is worth analyzing the relative importance of CpG sites in well-known age-related pathways such as SIRT, mTOR, and AMPK38,39,40.

Unexpectedly, most of the CpG sites in SIRT genes do not appear relevant, at least directly, for age prediction using AltumAge. Located in SIRT2, cg27442349, accounting for 0.026% of the total SHAP importance and ranked 1119, has the highest SIRT SHAP importance value (Supplementary Figure 10a).

Out of the 67 proteins participating in the mTOR signaling pathway according to the PID Pathways data set41, cg11299964, located in MAPKAP1, has the highest SHAP importance of 0.066%, ranking 148. Surprisingly, mTOR was not particularly relevant, with its most important CpG site being cg04508649 (SHAP importance = 0.0062, rank 7705) (Supplementary Figure 10b).

In terms of the AMPK pathway, out of CpG sites in genes for the proteins that directly activate or inhibit AMPK from the KEGG database42, cg22461835, located in ADRA1A, has the highest SHAP importance of 0.061%, ranking 169 (Supplementary Figure 10c). Most, however, ranked below 1000.

We also performed KEGG pathway analysis on the genes related to the top-ranking nine CpG sites using KEGGMapper43. We found the following genes associated with four of them–NHLRC1, involved in proteolysis; KLF14, associated with type 2 diabetes; BCO1, involved in metabolic pathways, including biosynthesis of cofactors; and FZD9, involved in a range of age-related diseases, including cancer and neurodegeneration. Note that DNA methylation affects gene expression depending on its position. A methylated CpG site in an enhancer, promoter, or gene body may impact gene regulation differently. These findings show how methylation in specific loci in aging-related pathways can contribute to age prediction. This insight may not be possible to obtain using ElasticNet due to its focus on selecting only the most important CpG sites related to aging. For example, only cg11299964 (from MAPKAP1 mentioned above) was present among the 353 sites selected by Horvath’s model.

### Potential biological applications

The age acceleration, defined as the predicted age minus the real age, of epigenetic clocks have been shown to be related to several biologically relevant events and characteristics, such as obesity44, menopause45, diet46, heart disease47, anxiety48, and even socioeconomic status49, among others. Given the observed performance of AltumAge, we explore its applicability to such studies, for which Horvath’s model has been a popular choice. Therefore, we assess AltumAge’s age acceleration performance with respect to the Horvath’s model in the following sections to understand the usefulness of its age prediction for downstream analyses.

#### AltumAge predicts higher age acceleration for certain pathologies

To determine whether some diseases accelerate aging, we investigated the separated, diseased samples (not used in training, validation, or testing) from 41 data sets described in Supplementary Note 1. We then compared the age acceleration of these separated samples to the age acceleration of the test samples from the same data set in a pairwise fashion.

There were 12 statistically significant (Mann-Whitney U-test p-value < 0.05) comparisons according to AltumAge, 11 of which predicted higher age acceleration for the diseased group (Fig. 5a). These include ectodermal samples from patients with autism (E-GEOD-50759, p-value = 3.41e-10), blood samples from patients with HIV (E-GEOD-67705, p-value = 2.93e−7), fetal samples from several pregnancy disorders (E-GEOD-74738, p-value = 3.29e−6), brain samples from patients with autism (E-GEOD-63347, p-value = 7.12e−5), brain samples from patients with multiple sclerosis (GSE40360, p-value = 0.00251), liver samples from patients with non-alcoholic fatty liver disease (E-GEOD-48325, p-value = 0.00431), brain samples from patients with HIV (E-GEOD-59457, p-value = 0.0131), pancreatic cells from patients with type 2 diabetes (E-GEOD-21232, p-value = 0.0158), blood samples from patients with Down syndrome (E-GEOD-52588, p-value = 0.0197), peripheral blood mononuclear cells from patients with autism (E-GEOD-27044, p-value = 0.0232), and blood vessel samples from patients with atherosclerosis (E-GEOD-62867, p-value = 0.0426). In Horvath’s case, there were 10 statistically significant comparisons, eight of which predicted higher age acceleration for the diseased group (Fig. 5b).

Given the high number of pairwise comparisons, it is likely that a few will be statistically significant by chance, as the expected number of false null hypothesis rejections for 41 pairwise comparisons with a p-value threshold of 0.05 is 2.05. Indeed, AltumAge predicted a lower age acceleration for blood samples of patients with schizophrenia (E-GEOD-41169, p-value = 0.0161) and Horvath’s model predicted a lower age acceleration for blood samples of patients with osteoporosis (GSE99624, p-value = 0.00531) and in sun-exposed skin (E-GEOD-51954, p-value = 0.0181); these may be artifacts.

Of note, there was no statistically significant age acceleration for both AltumAge and Horvath’s model in several data sets that include patients with obesity, Crohn’s disease, schizophrenia, asthma, chronic obstructive pulmonary disease, among others. Overall, however, these observations indicate that AltumAge predicts higher age acceleration for certain pathologies and may indicate which ones are epigenetically age-related.

#### AltumAge predicts higher age acceleration for cancer

Cancer cells display several genetic and epigenetic aberrations which have been related to aging and mortality by epigenetic clocks50,51,52. Liu et al.53 have reported that some age predictors consistently estimate higher age acceleration for tumors, whereas others show tissue-specific behavior. Therefore, we examined the age acceleration of cancer samples from 14 data sets comprising 10 tissue types in total for AltumAge, using Horvath’s model as a benchmark (Fig. 6). Overall, Horvath’s model was not able to differentiate between normal and tumor samples (Mann–Whitney U-test p-value = 0.156, Fig. 6a). Its median age acceleration for cancer was marginally higher in seven tissue types and lower in another three (breast, prostate, colon). AltumAge, in contrast, predicts overall higher age acceleration for cancer when compared to normal tissue by 4.542 years (Mann–Whitney U-test p-value = 8.28e−12, Fig. 6b). The median age acceleration was higher for tumors in every single tissue type examined except colon, and the mean age acceleration was higher for all tumors. These results indicate that AltumAge can generally differentiate between normal and cancerous tissue by predicting a higher age acceleration and could be useful to studies focusing on the relationship between cancer and aging. Note that the age accelerations of both models had a much higher overall variance in cancer versus normal tissue (AltumAge Levene’s test p-value = 7.32e−10; Horvath’s model Levene’s test p-value = 3.63e−12). This observation might be the consequence of a more chaotic epigenome in tumors.

#### AltumAge differentiates cells with age-related hallmarks

To assess whether AltumAge can capture biologically relevant changes in vitro, we examined independent data sets (not used for training, validation, or testing) to understand the effect of cell passage number, cellular senescence, mitochondrial depletion, NLRP7 knockdown, transient cellular reprogramming, and continuous cellular reprogramming. We compared AltumAge with Horvath’s model as a benchmark.

We looked into a study (with accession ID GSE3065354) which contains information of induced pluripotent stem cells (iPSCs) and embryonic stem cells (ESCs) by passage number. We observe that AltumAge detects a correlation (Pearson’s R = 0.351, p-value = 2.85e−06) between the predicted age and the passage number. As shown by Fig. 7a, cells begin with a slightly negative age that increases as they are passaged. Horvath’s model also detects a significant correlation (Pearson’s R = 0.273, p-value = 3.245e−04), albeit it is weaker when compared to AltumAge. The increase in age with passage number is also more subtle. The effect of passage number on predicted age indicates that AltumAge might be sensitive to cellular exhaustion due to passaging cell.

Given that cellular senescence is a well-known hallmark of aging and can be caused by excessive replication in vitro, we analyzed a study (with accession ID GSE9106955) which provides methylation data from fibroblasts at different stages of senescence (Supplementary Figure 11). Early passage cells did not have a statistically different predicted age with AltumAge (regression coefficient t-test p-value = 0.078), and only slightly with Horvath’s model (regression coefficient t-test p-value = 0.046). None of the other comparisons were statistically significant.

Mitochondrial dysfunction is another important hallmark of aging. A study (with accession ID GSE10024956) contains data on 143B cells chronically depleted of mitochondrial DNA (rho0-). AltumAge predicts a higher age for cells with mitochondrial dysfunction (regression coefficient t-test p-value = 0.036, Fig. 7b). Horvath’s model also predicts a suggestive increase (regression coefficient t-test p-value = 0.057).

The NLRP gene family of receptors, primarily expressed in immune cells, is involved in the normal response to inflammation. Mutations in some of these genes are involved in immune system malfunction, excessive inflammation, and disease57,58,59, suggesting it may also have ramifications in aging. Moreover, cells with NLRP7 knockdown display aberrant CpG methylation patterns60. We, therefore, analyzed DNA methylation data from H9 ESCs (accession ID GSE4572760) with or without NLRP7 knockdown. AltumAge and Horvath’s model predict a higher age for knockdown cells (regression coefficient t-test p-value = 0.009 and 0.019, respectively, Fig. 7c). When H9 cells are exposed to BMP4 differentiating medium, both models are still able to capture the increase in age (regression coefficient t-test p-value = 0.024 and 0.013 respectively, Supplementary Figure 12).

Cellular reprogramming is currently the most effective way to reduce the biological age of a cell. Thus, we analyzed the methylation from a study (with accession ID GSE5484861) to check whether reprogramming with the four Yamanaka factors (Oct-3/4, Sox2, Klf4 and c-Myc) would lead to rejuvenation. Indeed, there is a steady decrease in predicted age for AltumAge until day 20 (Fig. 7d). Horvath’s model, intriguingly, displays an initial increase in predicted age at day 3. This likely artifact is not present in AltumAge’s reprogramming time course.

Lastly, we investigated whether AltumAge could capture a rejuvenation event caused by transient expression of several reprogramming factors in aged human fibroblasts and endothelial cells (GSE142439)62. Sarkar et al.62 demonstrated that several biomarkers such as H3K9me3, SIRT1, HP1γ, and β-galactosidase were restored to a youthful state after treatment. Horvath’s model was able to capture a decrease in epigenetic age (linear mixed-effects model p-value = 0.004, Fig. 7e). Interestingly, there was no difference before and after the intervention according to AltumAge (linear mixed-effects model p-value = 0.966, Fig. 7b). While the researchers tracked the cells until six days after treatment, it is possible that the apparent restoration of youthful biomarkers would not endure. Indeed, studies have shown that transient reprogramming causes only temporary rejuvenation63,64,65. Altogether, AltumAge performs as well as Horvath’s model for most cases and, more importantly, it captures an expanded global methylation landscape. It can also robustly recognize age-related epigenetic patterns while potentially avoiding overestimating the impact of temporary interventions and detecting interventional artifacts.

## Discussion

The creation of new quantitative aging measurements has been rapidly expanding with the burgeoning field of the biology of aging. Epigenetic clocks are a tool that can aid researchers to understand better and to measure the aging process. In 2013, Horvath showed it was possible to use just a couple of CpG sites to predict a person’s age based on DNA methylation accurately. It was a giant leap in the field. However, his 2013 ElasticNet model or other versions of linear models are still widespread despite recent advances in machine learning. The accuracy of such linear models was so good that it was difficult to imagine a model significantly outperforming it66. Other deep learning methods, which slightly outperform ElasticNet, have focused thus far only in a single tissue type1617.

We show that our neural network-based model, AltumAge, is an overall better age predictor than ElasticNet. While our more comprehensive and larger data does improve the performance, the capability of neural networks to detect complex CpG-CpG interactions and the expanded feature set with 20,318 CpG sites also contribute to its lower error. For within-data set prediction - which is the case for several studies which create a new epigenetic clock - AltumAge performs drastically better than state-of-the-art methods. Even for LOOCV analysis, while the improved performance of AltumAge over ElasticNet was not as substantial, it performed better in older ages and new tissue types. Arguably, a more generalizable model like AltumAge can better capture pan-tissue age-related changes.

Deep learning models have shown promise in several biological tasks, given their good performance on unstructured data. They have been for many years seen as “black-box” models, but new interpretation methods have made it possible to get interesting insights for them. Our interpretation of AltumAge provides a detailed relationship between each of the 20,318 CpG sites and age, showing that while most CpG sites are mostly linearly related with age, some important ones are not. Given recent advances in epigenetic editing1, finding these DNA methylation sites to delay or reverse aging may be necessary for future interventions to tackle the disease. AltumAge allied with other deep learning inference methods can provide information on highly interacting CpG sites. Sometimes the primary locus of an epigenetic editing intervention, given its place in the genome, may be difficult to target because of the chromatin structure. Consequently, knowing secondary CpG sites that affect how the CpG of interest interacts with age could guide such interventions. We show that one can obtain biological hypotheses for the same from the data using AltumAge. For example, we observe that cg19722847 located inside the gene IPO8 could regulate cg10523019, which lies in RHBDD1. Analysis of ChromHMM annotations shows that the top-ranking CpG sites are associated with gene regulatory regions and CTCF binding sites. Finally, we highlight the age-related KEGG pathways obtained for genes with these CpG sites, indicating that the model is learning valuable biological information from the data.

We also explore how age acceleration as determined by AltumAge has potentially meaningful biological applications. AltumAge predicts higher age acceleration for cells with immune and mitochondrial, similarly to Horvath’s model. Moreover, AltumAge displays a higher correlation between cell passage number and predicted age. More importantly, AltumAge, in contrast to Horvath’s model, predicts higher age acceleration for cancer. It seems that epigenetic clocks highly predictive of mortality show this behavior consistently53. Age acceleration in tumors can be thought of as a further deviation from the original state in Waddington’s landscape. Intriguingly as well, AltumAge does not uncover a rejuvenation event from transient reprogramming62. This observation is possibly due to the temporary nature of the rejuvenation event, which may not globally change age-related DNA methylation patterns. Since AltumAge considers a much larger portion of the epigenome, it may be more resistant to detecting transient events. Another contrast with Horvath’s model is that during a reprogramming time course study, it does not show an initial increase in predicted age, likely an artifact.

In future work, it would be interesting to create a deep learning model with Illumina’s EPIC array with the roughly 850 thousand CpG sites to understand more deeply how genomic location can affect influence in aging. In addition, by having several CpG sites in a single gene, it is also possible to better understand how methylation in different positions may affect the contribution of a particular gene to the aging process. Currently, however, there are only a few EPIC array data sets publicly available. Another interesting application for deep learning in the aging field is the creation of epigenetic clocks that directly predict mortality. Currently, the state-of-the-art mortality predictor is GrimAge67, which was create based on a linear Cox proportional hazards model. We anticipate that using neural networks to include non-linear relationships and CpG-CpG interactions would result in a better lifespan predictor.

Overall, we have shown that deep learning represents an improvement in performance over current approaches for epigenetic clocks while at the same time providing new, relevant biological insights about the aging process.

## Methods

### DNA methylation data sets

For model training and testing, we gathered normal tissue samples from 142 publicly available DNA methylation data sets from the Gene Expression Omnibus, Array Express, and The Cancer Genome Atlas, comprising of the platforms Illumina Infinium HumanMethylation27 and the Illumina Infinium HumanMethylation450. All selected data sets had both processed beta values and age available for all samples. Missing values per data set were imputed with a KNN imputer from scikit-learn. Next, the data was normalized using the beta mixture quantile normalization (BMIQ) with the optimized code from Horvath, called BMIQCalibration3,68. 13 data sets contained tumor samples which were separated for later analysis. Samples that failed BMIQ normalization were removed. Then, each data set was split 60% for training (n = 8050) and 40% for testing (n = 5455). The within-data set split ensures the distribution of age, gender, and tissue type between training and testing sets are unbiased (Supplementary Figure 4). In the training set, the data was further randomly subdivided by data set, with 85 (n = 4394) for model selection and 57 (n = 3656) for validation. After the best model was selected, it was trained in the full training set (n = 8050) and analyzed in the test set. Supplementary Figure 1a shows a schematic of the division of the data. The full list of data sets used is available in Supplementary Note 1.

For twelve data sets in which gestational week was available, the encoding for age is the following:

$$y=7* \frac{w-40}{365}$$
(1)

where w is the gestational week, and y is the age in years. A gestational week below 40 would have negative age; for instance, 30 weeks would be encoded as 7*(30 − 40)/365 = − 0.192. In the US in 2013, the average birth occurred at an estimated 38.5 weeks69. This number has changed slightly over time, and since preterm deliveries skew the average more than late-term births, we considered gestational week 40 as age 0 in such data sets.

For the cancer age acceleration analysis, we compared the test set of 13 data sets with the aforementioned separated cancer samples. These data sets were GSE32393, GSE37988, GSE26126, GSE63384, GSE59157, GSE32867, GSE30759, GSE31979, GSE77955, GSE52068, GSE49149, GSE39004. We further added GSE53051, which contains normal and tumor samples from five tissue types, for the analysis. It is worth noting that analyzing GSE53051 separately did not change the outcome of higher tumor age acceleration predicted with AltumAge vs. no difference with Horvath’s model. In total, we compared 434 normal and 1856 cancer observations across 10 different tissue types (Figure 6).

Lastly, a brief description of all data sets used, either for training, validation, testing, or any other analysis, is in Supplementary Note 1. Any sample that was excluded is explicitly mentioned there alongside brief notes.

### CpG site annotation

For the annotation of CpG sites, GENCODE and Zhou et al’s annotations were used70,71. 41 data sets from ENCODE with the 18-state ChromHMM information were gathered37: ENCFF717HFZ, ENCFF718AGZ, ENCFF371WNR, ENCFF318XQO, ENCFF340OUL, ENCFF893CAJ, ENCFF151PZS, ENCFF098CED, ENCFF273PJW, ENCFF377YFI, ENCFF773VYR, ENCFF928QES, ENCFF786HDE, ENCFF827FZN, ENCFF364PIY, ENCFF802QCI, ENCFF021NNN, ENCFF510ZEI, ENCFF175NGE, ENCFF670DBL, ENCFF825ZCZ, ENCFF912ILE, ENCFF725WBV, ENCFF829SZB, ENCFF483NRC, ENCFF717RYX, ENCFF249ZBG, ENCFF205OTD, ENCFF765OKG, ENCFF820YPQ, ENCFF685BMF, ENCFF545ZMG, ENCFF294UQS, ENCFF104ZSA, ENCFF370EGY, ENCFF860FWW, ENCFF177TTP, ENCFF151ZGD, ENCFF743GHZ, ENCFF990YHL, and ENCFF036WIO. Since AltumAge is a pan-tissue clock, the mode of each state was chosen for each CpG site.

### Model selection

Multiple machine learning models were tested in the validation set. The evaluation metrics were median absolute error (MAE), mean squared error (MSE), Pearson’s correlation coefficient (R), and median error.

To validate and train the models, the beta value of each CpG site was scaled with a robust scaler which removes the median and scales according to the interquantile range. A robust scaler was chosen as opposed to a standard scaler (mean = 0, var = 1) to better resist distortions caused by outliers. The robust scaler was fitted with the training data of each model simply as a step before model training. In addition, only 20,318 CpG sites common to all three platforms Illumina Infinium HumanMethylation27, Illumina Infinium HumanMethylation450, Infinium Methylation EPIC were chosen as features.

The non-neural network models trained with scikit-learn were support vector regression (with an RBF kernel) and random forest with the standard hyperparameters. ElasticNet, trained with glmnet, used the built-in λ optimization with parameters alpha = 0.5 and n_splits = 10. Moreover, for ElasticNet, Horvath’s age transformation was used3.

To select the best performing neural network with tensorflow, we conducted a grid search with the following hyperparameters: number of hidden layers (2, 5, or 8), number of neurons per dense layer (32 or 64), activity and kernel L2 regularization coefficients (0, 0.0034, or 0.0132), dropout (0 or 0.1), Gaussian dropout (0 or 0.1), batch normalization (yes or no), activation function (ReLU or SeLU), and learning rate (0.0002, 0.0005, or 0.001). The following parameters were held constant: optimizer (Adam), batch size (256), number of epochs (300), loss function (MSE), and learning rate decay by a factor of 0.2 after a 30-epoch plateau in the training loss. The weights with the lowest training loss were chosen. After selecting the best hyperparameters, we trained the neural network with adversarial regularization with neural_structured_learning with multiplier = 0.05, adv_step_size = 0.005. The main idea of adversarial regularization is to train a model with adversarially-perturbed data (called adversarial examples) in addition to the original training data. These adversarial examples are constructed by intentionally adding noise and cause the model to mispredict. By training with such examples, the model learns to be robust against adversarial perturbations (or noise) when making predictions.

We dubbed the best performing deep neural network as AltumAge. It consists of 5 hidden layers, 32 neurons per layer, activity and kernel regularization coefficients of 0.0034, no dropout, gaussian dropout of 0.1, with batch normalization, SeLU activation, and learning rate of 0.001. AltumAge was also tested in the validation set with a smaller set of features using the CpG sites selected from the ElasticNet.

To compare our deep learning approach with other state-of-the-art neural networks, we tried TabNet, an attentive interpretable tabular learning method19. Similarly, TabNet was trained for 300 epochs.

The results for the validation set with the full list of models, including the replication of Horvath’s model3, is in Supplementary Table 1. Support vector regression was by far the worst performer (MAE = 14.229, MSE = 458.956), followed by random forest (MAE = 6.833, MSE = 165.354). All other models displayed R > 0.9, with AltumAge having the lowest MAE (3.563) and MSE (57.071).

### SHAP and DeepPINK

To obtain the SHAP values for AltumAge, the function GradientExplainer from shap was used on the test set. For the DeepPINK importance values and feature selection, the standard architecture and number of epochs was used28. To create the knockoff features for DeepPINK, the function knockoff.filter from the R 4.0.2 package knockoff version 0.3.3 was used with the importance statistic based on the square-root lasso.

Both SHAP and DeepPINK importance values were normalized so that their sum would equal to 100. Each importance value thus represents a percent contribution of a certain feature.

### Statistical analysis

All statistical tests were conducted with packages scikit-learn, statsmodel, or scipy. Tests that were one-sided were explicitly mentioned in the main text; all others were two-sided.

To assess the performance of the models, we used median absolute error (MAE), mean squared error (MSE), Pearson’s correlation coefficient (R), and median error. To compare the performance of ElasticNet and AltumAge in the within-data set setup, we also used Alpaydin’s Combined 5x2cv F test72, which combines several metrics to compare the statistical significance in error between two models across different cross-validation splits. This ensures that any improved performance is not simply due to a favourable random split.