RUNX1 mutations contribute to the progression of MDS due to disruption of antitumor cellular defense: a study on patients with lower-risk MDS

Patients with lower-risk myelodysplastic syndromes (LR-MDS) have a generally favorable prognosis; however, a small proportion of cases progress rapidly. This study aimed to define molecular biomarkers predictive of LR-MDS progression and to uncover cellular pathways contributing to malignant transformation. The mutational landscape was analyzed in 214 LR-MDS patients, and at least one mutation was detected in 137 patients (64%). Mutated RUNX1 was identified as the main molecular predictor of rapid progression by statistics and machine learning. To study the effect of mutated RUNX1 on pathway regulation, the expression profiles of CD34 + cells from LR-MDS patients with RUNX1 mutations were compared to those from patients without RUNX1 mutations. The data suggest that RUNX1-unmutated LR-MDS cells are protected by DNA damage response (DDR) mechanisms and cellular senescence as an antitumor cellular barrier, while RUNX1 mutations may be one of the triggers of malignant transformation. Dysregulated DDR and cellular senescence were also observed at the functional level by detecting γH2AX expression and β-galactosidase activity. Notably, the expression profiles of RUNX1-mutated LR-MDS resembled those of higher-risk MDS at diagnosis. This study demonstrates that incorporating molecular data improves LR-MDS risk stratification and that mutated RUNX1 is associated with a suppressed defense against LR-MDS progression.


DNA and RNA Isolation
For the preparation of the DNA library, DNA from bone marrow (BM) or, if BM was not available, peripheral blood (PB) was isolated using MagCore according to the manufacturers´ recommendations (RBC Bioscience, New Taipei City, Taiwan). DNA concentration was measured by Qubit 3.0 fluorometer (Life Technologies, Carlsbad, CA, USA) and quality was checked using Nanodrop (Thermo Fisher Scientific, Waltham, MA, USA).
For the preparation of the RNA library, BM CD34+ cells were isolated by magnetic separation on an autoMACS Separator (Miltenyi Biotec, Bergisch Gladbach, Germany). Then, RNA was isolated by acid guanidinium thiocyanate-phenol-chloroform extraction. RNA concentration was measured with a Qubit 3.0 fluorometer (Life Technologies) and RNA quality was checked on an Agilent 2100 bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). Only high-quality RNA with an RNA integrity number (RIN) of at least 7.5 was used.

Targeted gene sequencing
50 ng of DNA were used to prepare the indexed library according to the manufacturer's protocol (TruSight Myeloid Sequencing Panel Kit, Illumina, San Diego, CA, USA). Quantification of the prepared library was done with the KAPA Library Quantification Kit for Illumina sequencing platforms (KAPA Biosystems, Wilmington, MA, USA) also according to the manufacturer's protocol. Sequencing was performed on the MiSeq or NextSeq (Illumina). Analysis was performed with NextGene software (SoftGenetics, State College, PA, USA) and in-house pipeline (described separately in the following paragraph). In paired samples, variants with VAF up to 0.01 were detected if the variant was present in VAF more than 0.05 in one of the paired samples. SIFT and Polyphen were used for the prediction of variant effects. Visualization of NGS results was done with R 4.0.2 software (circlize 0.4.13, ComplexHeatmap 2.8.0).

In-house pipeline for analyzing NGS data from TruSight Myeloid Sequencing Panel:
The quality of raw data obtained from high-throughput sequencing was checked by FastQC version 0.11.8. Reads were then trimmed and filtered using Trimmomatic software version 0.39 and resulting files were quality checked by FastQC again. Cleaned up data from DNA sequencing were then mapped to GRCh19 version of human genome using BWA aligner version 0.7.17. The mapped data was further indexed and sorted using the Samtools suite of tools version 1.10 and the percentage of mapped reads was assessed. Samples with a percentage of mapped reads exceeding 95 % were processed using variant calling software Freebayes version 1.3.1. To discover additional insertions and deletions that span across long sections of the genome, Pindel software version 0.2.5b9 was used. The discovered variants in the form of VCF files were then filtered and annotated using the online interface of Ensembl Variant Effect Predictor (VEP). The annotated variants were formatted in the R software version 4.0.2 and then exported to TSV format.

Sanger sequencing
DNA from CD3+ cells was used in PCR reaction with Q5 Hot Start High-Fidelity DNA Polymerase (New England Biolabs, Ipswich, MA, USA) and the PCR reaction was run according to the manufacturer's protocol. The sequences of the primers are in Table SI 1. 10 µl of the PCR product was run on the 1% TAE gel at 80 V to control the PCR reaction. The PCR product was then cleaned with ExaSap-IT PCR Product Cleanup Reagent (Thermo Fisher Scientific) according to the manufacturer's protocol. The next step was to prepare the sequencing PCR reaction with the BigDye Terminator v3.1 Cycle Sequencing Kit (Thermo Fisher Scientific) using 0.5 μl of the PCR product. The PCR products were then cleaned with the DyeEx 2.0 Spin Kit (QIAGEN, Venlo, The Netherlands) according to the manufacturer's protocol. Finally, the sample was analyzed on the ABI 3500 (Thermo Fisher Scientific) and the sequences were visualized on Sequencing Analysis Software 5.4 (Thermo Fisher Scientific).

RNA sequencing
100 ng of total RNA from CD34+ cells of 8 LR-MDS patients with RUNX1 mutation, 29 LR-MDS without RUNX1 mutation, 20 HR-MDS and 13 healthy controls (SI 3) was ribodepleted with the RiboCop rRNA Depletion Kit (Lexogen, Wien, Austria). Sequencing was performed on HiSeq 2500 or NovaSeq (Illumina). Raw reads in the form of FASTQ files were trimmed and filtered using Trimmomatic 0.39 and their quality was assessed using FastQC 0.11.8. The quantification of gene expression was performed by StringTie2 software 1.3.6. The filtered reads were mapped to human genome GRCh38.p13 using STAR 2.7.2b. The quantification of gene expression was performed by StringTie2 software 1.3.6. For analysis and visualization of expression data, several packages in R software 4.0.2 (e.g. edgeR 3.30.3, pheatmap 1.0.12, ggplot 3.3.2, pcaMethods 1.84.0, ComplexHeatmap 2.8.0) and GraphPad Prism 7 software (GraphPad Software, La Jolla, CA, USA) were used. Databases such as Gene Ontology, KEGG Pathways, Reactome Pathways, and ConsensusPathDB were used for functional enrichment analysis.

Machine Learning
Genes with negligible mutation occurrence (mutated in fewer than 6 subjects) were grouped into one category (REST). The variables were therefore: ASXL1, CUX1, DNMT3A, EZH2, JAK2, PHF6, RUNX1, SETBP1, SF3B1, SRSF2, STAG2, TET2, TP53, U2AF1, ZRSR2, REST. In multivariate Cox regression with stepwise backward feature selection, the Aikake information criterion was used by default (rms R library, fastbw function). It is a heuristic criterion, and its application led to very small models of only one gene. Then, after the cross-validated experiment, the model was adjusted for the optimal number of features according to the D value and the general recommendations for the number of events per variable (EPV) in survival regression models (Peduzzi et al. 1995). In our data, we had 214 subjects, 81 death events; then the EPV should be around 10, which means we should have not worked with more than 8 features. Lasso regression (elastic networks) that worked with L1 norm and minimized the number of features was used. The optimal parameterization/number of features was set in crossvalidation again, the model quality was measured with the Harrell's C-index (the concordance index, its value is between 0 and 1). We got two recommendations, lambda.min (optimum) and lambda.1se (a regularized model near optimum result).

Machine Learning -Cross-Validation
In the cross-validation experiments of SBFS, the maximum D-value was approximately 0.17 for data1 and 0.18 for data2 in OS, respectively. The maximum D-value for both PFS datasets was approximately 0.25. In Table SI 11B, the most significant genes are listed in the number that should be ideal for individual datasets according to the cross-validated value of the D measure. Extending the model with comutational data, the maximal D value increased in OS but decreased in PFS (SI 14C).
In cross-validating EN, the highest C-index was around 0.6 in all analyses (SI 12A) and the number of optimal features is specified within the table of results (SI 12B). Comutational data did not improve the C-index.

Machine Learning -Individual hazard ratio model
We used our data to create a hazard ratio model to count the hazard ratio for individual patients.
The predictions corresponded to logarithmic relative hazards: where xi denoted the i th covariate (mutations in our case), βi the i th coefficient (the effect size of the given covariate) and the individual k' represented the baseline (average) individual. Counting the hazard ratios between individuals, h0(t) became unimportant since it remained the same for all the individuals. Therefore, we were able to count the relative hazard ratio for individual patients from individual analysis.