Applications of Bayesian network models in predicting types of hematological malignancies

Network analysis is the preferred approach for the detection of subtle but coordinated changes in expression of an interacting and related set of genes. We introduce a novel method based on the analyses of coexpression networks and Bayesian networks, and we use this new method to classify two types of hematological malignancies; namely, acute myeloid leukemia (AML) and myelodysplastic syndrome (MDS). Our classifier has an accuracy of 93%, a precision of 98%, and a recall of 90% on the training dataset (n = 366); which outperforms the results reported by other scholars on the same dataset. Although our training dataset consists of microarray data, our model has a remarkable performance on the RNA-Seq test dataset (n = 74, accuracy = 89%, precision = 88%, recall = 98%), which confirms that eigengenes are robust with respect to expression profiling technology. These signatures are useful in classification and correctly predicting the diagnosis. They might also provide valuable information about the underlying biology of diseases. Our network analysis approach is generalizable and can be useful for classifying other diseases based on gene expression profiles. Our previously published Pigengene package is publicly available through Bioconductor, which can be used to conveniently fit a Bayesian network to gene expression data.


Supplementary File 5: RNA-Seq analysis on the BCCA dataset
We constructed a retrospective AML and MDS sample cohort with some known prior karyotype and genetic testing information. Samples were selected so as to encompass a broad range of myeloid malignancy subtypes. We used only AML-NK and MDS in this study.

Library Preparation and Sequencing Sample Acquisition and Ethics
Peripheral blood and bone marrow samples were obtained from consenting patients via the Hematology Cell Bank of British Columbia (http://hematology.med.ubc.ca/research/hematologycell-bank-of-bc/). Ethics protocols were all approved by the BCCA REB, under protocols H04-61292, H09-01779, H11-01484, and H13-02687.

RNA Extraction and Library Construction
RNA was manually extracted from bone marrow or peripheral blood using Qiagen Allprep kits. Total RNA samples were checked using Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip. Samples that passed quality check were arrayed into a 96-well plate. Following this, polyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from total RNA with on column DNaseItreatment as per the manufacturer's instructions. The eluted polyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).
Double-stranded cDNA was synthesized from the purified polyA+ RNA using the Maxima H Minus First Strand cDNA Synthesis Kit (Thermo Fisher Scientific Inc., USA) and random hexamer primers. Quality passed cDNA plate was fragmented by Covaris LE220 for 2x65 seconds at "Duty cycle" of 30%. The paired-end sequencing library was prepared following the BCCA Genome Sciences Centre paired-end library preparation plate based library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using magnetic beads and 3' A-tailing by Klenow fragment (3' to 5' exo minus). After cleanup, adapter ligation was performed. The adapter-ligated products were purified using magnetic beads, then UNG digested and PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc., USA) using Illumina's PE primer set in a single reaction, with cycle condition 37˚C 15min, 98˚C 1min followed by 13 cycles of 98˚C 15 sec, 65˚C 30 sec and 72˚C 30 sec, and then 72˚C 5min. The PCR products were purified and size selected using magnetic beads, checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified with the Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen). Libraries were normalized and pooled. The final concentration was double checked and determined by Qubit dsDNA HS Assay for Illumina Sequencing.

Sequencing
For the first retrospective RNA-Seq cohort, we sequenced a single library per Illumina HiSeq 2000 lane, using 2x75bp reads, which resulted in approximately 400 million reads per library. 92 samples were initially submitted for library construction, of which 89 were successfully prepared and sequenced.
For the second retrospective RNA-Seq cohort, we sequenced two libraries per Illumina HiSeq 2000 lane, using 2x75bp reads, which resulted in approximately 200 million reads per library. 92 samples were initially submitted for library construction, of which 87 were successfully prepared and sequenced.

Pipeline Overview
All samples were processed using customized in-house bioinformatics pipelines. The WGS and WES data was processed by the Bioinformatics Core at the BCCA Genome Sciences Centre, while the RNA-Seq data was processed by the Centre for Clinical Genomics Informatics group and the authors.

RNA-Seq Quality Control
We used RNA-SeQC (DeLuca et al., 2012) to gather quality metrics for the RNA-Seq libraries. This tool gathers standard sequence quality metrics, such as the count of uniquely aligned and duplicate reads, and RNA-specific metrics such as the proportion of rRNA reads, strandedness of read alignments, and the proportion of reads aligning to exonic, intronic, intragenic, and intergenic regions.

Expression Quantification
Expression quantification was performed for all RNA-Seq and ssRNA-Seq libraries using sailfish version 0.9.0 (Patro, Mount, & Kingsford, 2014), using RefSeq gene models downloaded as GTF from the UCSC genome browser on 2014-08-21, with gene models from non-standard chromosome sequences removed. Both isoform-and gene-specific quantifications were generated, and raw estimated counts, as well as transcripts-per-million (TPM), and RPKM estimates were used in downstream analysis.