Stratified computational meta-analysis of 2213 acute myeloid leukemia patients reveals age- and sex-dependent gene expression signatures

In 2018 alone, an estimated 20,000 new acute myeloid leukemia (AML) patients were diagnosed, in the United States, and over 10,000 of them are expected to die from the disease. AML is primarily diagnosed among the elderly (median 68 years old at diagnosis). Prognoses have significantly improved for younger patients, but in patients older than 60 years old as much as 70% of patients will die within a year of diagnosis. In this study, we conducted stratified computational meta-analysis of 2,213 acute myeloid leukemia patients compared to 548 healthy individuals, using curated publicly available data. We carried out analysis of variance of normalized batch corrected data, including considerations for disease, age, tissue and sex. We identified 974 differentially expressed probe sets and 4 significant pathways associated with AML. Additionally, we identified 70 sex- and 375 age-related probe set expression signatures relevant to AML. Finally, we used a machine learning model (KNN model) to classify AML patients compared to healthy individuals with 90+% achieved accuracy. Overall our findings provide a new reanalysis of public datasets, that enabled the identification of potential new gene sets relevant to AML that can potentially be used in future experiments and possible stratified disease diagnostics.


INTRODUCTION 31
Acute myeloid leukemia (AML) is a heterogeneous malignant disease of the neoplasms and acute leukemia classification system 5 , AML prognosis criteria for 46 therapeutic strategy for younger patients 13,14 . However, in patients older than 60 70 years old, prognoses remain grim and therapeutic strategy has been nearly the 71 same for more than 30 years 2,6,13-15 . Approximately 70% of AML patients 65 72 years of age or older die within a year following diagnosis 16 . While it is apparent 73 that the nature of AML changes with age, still little is known about the extent of 74 these associations and how they vary with patient's age 14,17,18  across multiple, with potential prognostic features. In this study, we performed 85 comprehensive gene expression meta-analysis of 2213 acute myeloid leukemia 86 patients and 548 healthy subjects using 34 publicly available gene expression 87 microarray datasets (following strict inclusion criteria) to identify disease, sex-88 and age-related gene expression changes associated with AML. We identified 89 sex-and age-related gene expression signatures that show similar alteration in 90 gene expression levels and associated signaling pathways in AML and have 91 used our results (gene sets) to predict AML or healthy status. We believe that our 92 results may lead to improved AML early detection and diagnostic testing with 93 target genes, which collectively can potentially serve as sex-and age-dependent 94 biomarkers for AML prognosis compared to healthy, as well as the identification 95 of new treatment targets with mechanisms of action different from those used in 96 conventional chemotherapy 97 98 RESULTS 99

Data curation and gene expression preprocessing. 100
We searched the Gene Expression Omnibus (GEO) public repository, based on 101 our systematic workflow and inclusion criteria, Fig. 1a-b. Overall, 2,132 datasets 102 were screened, 643 selected (577 were excluded as non-Affymetrix, various 103 platform arrays). From the 66 remaining, 34 studies were excluded due to lack of 104 metadata, non-peripheral blood and non-bone marrow tissues, cell line or cell-105 type specific, treated subjects). After this curation we obtained 34 age-annotated 106 gene expression datasets from 32 different studies covering 2,213 AML patients 107 and 548 healthy individuals. The sets were re-analyzed, starting from raw data, 108 to perform a gene expression analysis of variance and functional pathway 109 enrichment analysis (see online Methods).

Batch correction 133
Our pre-processed data, AML and healthy, were processed using a "dataset-134 wise" batch effect correction approach. The datasets used in this study did not 135 include within-study healthy controls, which would limit analysis of variance, and 136 particularly the ability to separate biological from batch effects. To address this, 137 we implemented an iterative batch effect correction approach, essentially 138 employing a weight-based method for correcting batch effects. Assuming the 139 batch effects due to each data set is a function of the number of samples in the 140 data set (weight), normalizing sets of unevenly sized datasets may lead to 141 unbalanced batch correction. We used 5 additional datasets as a reference set, 142 which we refer to as "covariate" hereafter. Each of the covariate reference 143 datasets included within study healthy controls. All 5 datasets together consisted 144 of a total 613 arrays (455 AML and 158 healthy) ( Table 2), and pre-processed 145 exactly as our curated data sets. These were used together with each of the 146 remaining datasets to batch correct each dataset with respect the covariate 147 reference using ComBat 23 . After this dataset-wise correction, the 5 covariate 148 reference datasets were removed, and our expression data were clustered using    Gene expression meta-analysis was also used to identify DEPS that show sex 226 differences between male AML patients as compared to female AML patients. 227 266 DEPS were regarded statistically significant (p-value < 2.2x10 -7 ). A list of all 228 266 DEPS (including whether higher in either males or females, gene title and 229 symbol, male-female mean difference, and Bonferroni corrected p-value) can be 230 found as Supplementary Table S3 online. 70 DEPS were found to overlap 231 between analysis 1 (AML disease state) and analysis 2 (Sex-relevance in AML). 232 cancer 1, and PMAIP1 in p53 signaling pathway 1, and MS4A1 was higher in 246 females and found in Hematopoietic cell lineage pathway (Table 3). Figure 3d  247 shows GO analysis results, where 15 overrepresented biological GO terms were 248 overlapped, including terms for extracellular space, immune response, protein 249 binding, spindle, and midbody. The entire list of our enrichment analysis 250 (statistically significant KEGG and GO terms) can be found as Supplementary 251 Table S4. 252 253 Analysis 2b. Age-dependent differential gene expression meta-analysis and 254 associated signaling pathways in AML. 255 The subjects were binned in 8 age-groups: 0-19, 20-29, 30-39, 40-49, 50-59, 60-256 69, 70-79, and 80-100 years old. From this meta-analysis, 1395 unique probe 257 sets across all age-groups were identified as statistically significant (Bonferroni 258 adjusted p-value < 2.2x10 -7 ) (Supplementary Table S5). From these 375 unique 259 DEPS (372 unique gene symbols) were found to overlap with the 974 DEPS 260 probe sets from our AML disease state meta-analysis, accounting for an overall 261 1400 binary comparisons between the multiple age groups deemed statistically 262 significant, based on Tukey HSD tests between age-group pairs. The entire list of 263 1400 identified pairwise differences between age groups and associated probe 264 set/gene information can be found as Supplementary Table S6   To investigate further the progression with age, pairwise correlations between 273 age-groups were computed. The 0-19 age-group was used as a common 274 comparison reference with respect to other groups. Using this 0-19 group as a 275 baseline, Figure 4d shows the mean difference of 25 DEPS with respect to the 0-276 19 baseline across all other groups. The mean difference values between AML 277 and healthy are shown in the right-most column of Fig. 4a In the present study, we aimed to establish, disease sex-linked and age-289 dependent biomarkers from genes with similar changes in gene expression 290 levels and associated signaling pathways relevant to AML. Utilizing microarray 291 gene expression data and combined with various machine learning models, 292 respectively, our biomarkers were indicative of prognostic signature for AML 293 prediction compared to healthy with 90+% achieved accuracy. We re-analyzed 294 data aggregated from our curation of 34 publicly available microarray gene 295 expression datasets covering 2213 AML patients and 548 healthy individuals to 296 identify changes in AML gene expression associated with disease state (AML 297 compared to healthy), sex-linked (male compared to female), and age-dependent 298 (across age-groups compared to baseline). 299 We performed 3 differential probe set (gene) expression and gene enrichment 300 analyses, as discussed below. We note here that our study identified multiple 301 potentially significant DEPS, with age and sex related differences associated with 302 AML. While our findings may generate further hypothesis-driven investigations, 303 we need to also identify the study's limitations: primarily the analysis of AML and 304 healthy subjects involved bone-marrow and blood samples respectively in each 305 disease group. We tried to account for this utilizing tissue as an effect in our 306 linear model, and including multiple interactions. Other limitations include an 307 unbalanced AML/healthy ratio, as well as the lack of in-study healthy controls. To 308 address these we attempted to account for batch effects using a dataset-wise 309 iterative batch correction transformation, as discussed in the methods. Finally, 310 we also included binary interactions between the factors in the analysis to 311 account for interaction-related confounding effects.  Table 2 with their respected Tukey's HSD mean difference and 323 Bonferroni p-adjusted values. As shown in Figure  were found statistically significant in this analysis, with 70 found to overlap with 375 the DEPS from Analysis 1 (Fig 3a-b). The top10 up-and down-regulated DE 376 genes with respect to females include (Fig. 3c)  (Protein Kinase X-Linked) were as higher in females. These genes are known to 382 be sex-specific and show such differences and sex separation within the AML 383 and the healthy groups respectively (Fig. 3d). The role of these genes as positive 384 controls in studies with AML needs to be investigated further. We also reported 385 sex and AML known genes that were statistically significant in our analysis, 386 including FLT3 and MAL. 387 388 iii) Analysis 2b: Age-dependent gene expression meta-analysis and associated 389 signaling pathways in AML compared to healthy individuals, was carried out to 390 identify common set of age-dependent genes and associated signaling pathways 391 and to explore age-dependent trends in gene expression in AML. The age-392 dependent meta-analysis in AML using ANOVA, identified 1,395 DEPS 393 (Bonferroni adjusted p-value <0.01). To identify age-related DEPS in AML we 394 overlapped the 1,395 DEPS to our findings of 974 DEPS in AML disease state 395 (Analysis 1) (Fig. 4a), and identified an overlap of 375 DEPS (Bonferroni 396 adjusted p.value <0.01). As shown in Figure 4b, the top 10 most and least DE 397 age-associate genes in AML according to the mean difference values in seven 398 age-groups, including their corresponding values from AML disease state in 399 column "AML -healthy" for comparisons. Interestingly, CRISP3 was among the 400 down regulated genes specifically and involved in this analysis as well, 401 specifically associated with differences in younger age groups, 20 to 49 years of 402 age as compared to 0 to 19 age group. Other genes showing age-specific 403 differences included HOXA3, HOXA5 and HOXA10-HOXA9, which belong to the 404 homeobox genes (HOX) family of transcription factors, essential to embryonic 405 development and hematopoiesis, and associated with chromosomal 406 abnormalities translocation and over-expression in AML 44,45 . Also identified with 407 age-specific DE, was ORM1, which in Analysis 1 was among the top-10 most 408 under-expressed genes, and was also among the 70 DE genes in analysis 2a. 409 ORM1's direct role in AML also merits further investigation, given ORM1 410 involvement in immunosuppression and inflammation 46  The generalized workflow consisted of five main steps: i) Curation of microarray 427 gene expression data, ii) Preprocessing of raw data files followed by batch effect 428 correction, iii) Predictions of missing annotation data using supervised machine 429 learning, iv) Differential gene expression analysis, and v) Gene enrichment for 430 pathway analysis that includes gene annotation, and finally gene expression-431 based prediction of AML (Fig. 1a). Python script was used that utilized functions from the Entrez Utilities from 439 Biopython 48 . We used the script to navigate the GEO records, and download 440 microarray gene expression datasets up to 10/18. We additionally utilized Python 441 packages, including Pandas, NumPy, and Matplotlib for data structure, numerical 442 computing for data processing, and data visualization respectively. We used 443 strict inclusion criteria to maintain consistency in each dataset selection, screen 444 for availability of both raw and meta-data annotation files provided, human 445 samples used from untreated subjects, and that the sample source was from 446 either bone marrow (BM) and/or peripheral blood (PB). Array platform was 447 restricted to Affymetrix, which was found to have the most available data, and to 448 avoid cross-platform normalization issues. Inclusion criteria and the data curation 449 workflow are illustrated in Fig. 1 a-  The trained models for classification of missing sex and sample source 515 annotation from curated data achieved > 95% classification accuracy with ~ 3-5% 516 classification errors. Confusion matrix details, model accuracy and error for 517 training and testing are presented in Supplementary Table S1 online, and results 518 in Supplementary files 1 and 2. To account for training overfitting, we used 10-519 fold cross-validation on all 1,956 gene expression data arrays for training and 520 validation. 521 522

Dataset-wise correction approach for batch effects correction. 523
Batch correction was done using a dataset-wise correction. Here we refer to the 524 term "dataset-wise correction," to indicate performing batch correction iteratively 525 on one dataset at a time, against a reference set of datasets chosen to account 526 for variability. We used this approach to account for the lack within-study healthy 527 controls in the curated gene expression datasets. To address this issue, we used 528 5 additional datasets the included within-study controls, GEO accessions: 529 GSE107968, GSE68172 58 , GSE17054 59 , GSE33223 60 , and GSE15061 61 (Table  530 1B). We refer to the latter datasets hereafter as "covariate" reference datasets, 531 as they were as the reference datasets in the batch correction. Our approach 532 aimed to balance/distribute the weight of batch effects exerted by each dataset, 533 as this is dependent on the number of observations within a given dataset. 534 Combined, the covariate reference datasets included 613 total arrays, totaling 535 455 AML and 158 healthy controls. We used ComBat 23 to correct for study batch 536 effects, as its empirical Bayes-based algorithm uses both scale and mean center 537 based methods, providing an appropriate algorithm 23 . Covariate reference 538 datasets were treated as the covariate for batch during batch correction, to 539 improve performance in correcting for batch effects rather than biological 540 variation. After batch correction, we used principal component analysis (PCA), 541 visualizing components in both 2 and 3 dimensions, to compare the clustering 542 results for corrections. Covariate reference datasets were removed after the 543 batch correction step and were not part of our downstream meta-analysis. 544 ( Supplementary Fig. S1).

Yi ~ (a + s + d + t) + (a:s + a:d + a:t) + (s:d + s:t) + (d:t) + ε, 555
where d is the disease state (AML or healthy), a is age (between 0 to 100 years), 556 s is sex (female or male), t is sample source (BM or PB), and ε is a random error 557 term. We note that the model includes sample source and its interactions to 558 address comparisons involving different tissues in AML and healthy subjects (BM 559 or PB respectively). testing was done on all 5 covariate data sets, include AML and healthy subjects. 586 Dependent, target , and testing data files were prepared in accordance with 587