Main

Epigenetic mechanisms associated with DNA methylation of cytosine residues at CpG dinucleotides have a central role in normal human development and disease (Baylin and Jones, 2011). Advancements in high-throughput assessment of DNA methylation using microarrays or second-generation sequencing-based approaches have enabled the quantitative profiling of DNA methylation of CpG loci throughout the genome. As well as profiling the methylome of tumour compared with normal tissue, this has ushered in the era of epigenome-wide association studies, analogous to the genome-wide association studies, aimed at understanding the epigenetic basis of complex diseases such as cancer. The promise of methylation profiling in improving our understanding of cancer coupled with the current trend of decreasing cost and increasing coverage of DNA methylation microarrays has brought about a surge in the use of these technologies.

Here, we aim to provide both a review of issues encountered in the processing and analysis of array-based DNA methylation data and a summary of recent approaches proposed for handling those issues. Excellent reviews in the field of epigenetics and technical aspects of array-based assessment of DNA methylation are available, although this is a constantly developing research area (Laird, 2010; Petronis, 2010; Baylin and Jones, 2011; Rakyan et al, 2011; Bock, 2012). We seek to update perspectives on statistical issues that arise in the processing and analysis of array-based DNA methylation data (Siegmund, 2011), highlighting more recent methods proposed for this purpose. The subheadings shown in Figure 1 form the basis for the topics highlighted in this review. Our aim is to help researchers understand the growing body of statistical methods for array-based DNA methylation data, focusing on those freely available in open-source environments such as R or Bioconductor (Table 1). For this review, we chose to focus on Illumina’s BeadArray assays; however, many of the general considerations described here are applicable to other array technologies. We also aim to counter some of the perceived limitations of these arrays, that is, there are too many ‘false positives’ in analysing microarray data (Ioannidis, 2007). We present the viewpoint that appropriate experimental design and downstream data processing and analysis pipelines will enable DNA methylation to be appropriately analysed and will help in understanding the pathogenesis of human disease.

Figure 1
figure 1

Methylation array data processing and analysis pipeline.

Table 1 R/Bioconductor packages for the processing and analysis of array-based DNA methylation data

Illumina BeadArray technology for methylation

Illumina adapted its BeadArray technology for genotyping to recognise bisulphite-converted DNA for the interrogation of DNA methylation (Bibikova et al, 2011). The Illumina BeadArray assays use oligonucleotides conjugated to bead types to measure specific target sequences, measuring multiple beads per bead type. The bead types are summarised by the average signal for methylated (M) and unmethylated (U) alleles, and are used to compute the β-value, where

A β-value of 0 equates to an unmethylated CpG site and 1 to a fully methylated CpG site. Illumina has developed three platforms for array-based assessment of DNA methylation: GoldenGate, Infinium Human Methylation27 and the Infinium HD 450K methylation array, which all use two fluorescent dye colours but differ in the chemistries used to recognise the bisulphite-converted sequence; however, we will focus on the Infinium arrays for the rest of this work, as the GoldenGate array has been phased out from production. Furthermore, Illumina has developed their GenomeStudio software (Bibikova et al, 2011), which enables basic data analysis; however, for more in-depth analysis, many tools have been developed, as we will discuss below.

Quality control of samples

The Infinium arrays include several control probes for determining the data quality, including sample-independent and -dependent controls (Illumina, 2011). To detect poorly performing samples in Illumina arrays, diagnostic plots of control probes in GenomeStudio are often used (Bibikova et al, 2011), and the R-package HumMethQCReport (Mancuso et al, 2011) also provides these plots. Figure 2 shows hybridisation and bisulphite conversion plots for 450K data in the green channel. Although the sample-independent and -dependent controls can be visually inspected to identify poor performing samples, an alternative approach involves using the raw signal intensities of the control probes and determining whether they are beyond the expected range (e.g., median±3 s.d.) of the signal intensities across all samples.

Figure 2
figure 2

Quality control example from GenomeStudio 450K data. (A) Hybridisation quality control plot in the green channel. (B) Bisulphite conversion quality control plot in the green channel. In this example, the separation between high and low values indicates that hybridisation worked well. Furthermore, bisulphite conversion also performed well as converted controls have a higher signal than unconverted controls.

Other options for quality control of samples, which make use of detection P-values, are available in R and Bioconductor packages, such as the preprocessing and analysis pipeline (Touleimat and Tost, 2012), IMA (Wang et al, 2012), Minfi (Hansen and Aryee, 2013) and MethyLumi (Davis et al, 2011).

Quality control of probes

Similar to sample quality control, it is customary to filter probes if a certain proportion of samples (i.e., >25%) have a detection P-value below a certain prespecified threshold (i.e., P<0.05) (Bibikova et al, 2011). In the IMA package (Wang et al, 2012), probes with missing values, those residing on the X chromosome, and those with a median detection P-value >0.05 across samples can be filtered out; other packages allowing such filtering include (Davis et al, 2011; Touleimat and Tost, 2012).

LumiWCluster (Kuan et al, 2010) includes a function for model-based clustering of methylation data using a weighted likelihood approach wherein higher-quality samples (i.e., those with a low median detection P-value) have larger weights and thus greater influence on the estimation of the mixture parameters for cluster inference. This approach avoids discarding probes, characteristic of hard-thresholding approaches, allowing the incorporation of all the data while accounting for the quality of individual observations.

A potential issue for quality control at the probe level stems from certain probes targeting CpG loci, which include single-nucleotide polymorphisms (SNPs) near or within the probe sequence or even in the target CpG dinucleotide; in fact, there may be up to 25% probes on the 450K array that are affected by an SNP (Liu et al, 2013). As methylation levels of a specific locus may be influenced by genotype (Dedeurwaerder et al, 2011a), investigators may want to remove those SNP-associated loci from their data, and several R packages have options for carrying this out (Touleimat and Tost, 2012; Wang et al, 2012). Genetic effects, however, should not be underestimated in methylation arrays. As was recently demonstrated in Fraser et al (2012), a large portion of population-specific DNA methylation levels may in fact be due to population-specific genetic variants, which are themselves affected by genetic or environmental interactions. Although rare SNPs are unlikely to affect methylation levels to a large extent, somatic mutations can impact methylation levels greatly, such as driver mutations in a tumour; hence, the importance of subsequent sequencing validation.

Additional probes that a researcher may want to remove from their data include the ‘Chen probes’. This is evidenced in a recently published paper showing that there may be spurious cross-hybridisation of Infinium probes on the 450K array and further suggesting that cross-hybridisation to the sex chromosomes may account for the large gender effects that researchers have found on the autosomal chromosomes (Chen et al, 2013). Finally, a number of SNP probes are also included on the Infinium array that can help identify mislabelled samples, as implemented in wateRmelon (Pidsley et al, 2013).

Background correction

Background correction is platform specific, helps to remove nonspecific signal from total signal and corrects for between-array artefacts. Although this can be performed using Illumina’s GenomeStudio, several R packages contain background correction functions. This includes the preprocessing and analysis pipeline for 450K data (Touleimat and Tost, 2012), providing background-level correction using lumi (Du, 2008), and furthermore Limma (Wettenhall and Smyth, 2004) and MethyLumi (Davis et al, 2011). Background can also be estimated by direct estimation from the density modes of the intensities measured by each probe. However, the latter has been shown to produce aberrant DNA methylation profiles, so using negative control probes may be preferred (Touleimat and Tost, 2012). One can also use Minfi (Hansen and Aryee, 2013) as a background estimation method; however, the authors acknowledge that this method may result in differing values compared with those estimated via GenomeStudio.

Normalisation

Normalisation concerns the removal of sources of experimental artefacts, random noise and technical and systematic variation caused by microarray technology, which, if left unaddressed, has the potential to mask true biological differences (Sun et al, 2011a). Two different types of normalisation exist: (1) between-array normalisation, removing technical artefacts between samples on different arrays, and (2) within-array normalisation, correcting for intensity-related dye biases (Siegmund, 2011).

Owing to the features of DNA methylation, there is a lack of consensus regarding the optimal approach for normalisation of methylation data. Specifically, there is an imbalance in methylation levels throughout the genome creating a skewness to the methylation log-ratio distribution; the degree of this skewness is dependent on the levels of methylation in particular samples (Siegmund, 2011). This imbalance is due to the non-random distribution of CpG sites throughout the genome and the link between CpG density and DNA methylation; for instance, CpG islands (CGI) are often unmethylated, whereas the opposite relationship is typically seen in non-CGIs in normal human cells (Baylin and Jones, 2011). Furthermore, total fluorescence signal is inversely related to DNA methylation levels (Siegmund, 2011). Many available normalisation methods were designed for gene expression array data and are based on assumptions that may not be appropriate for DNA methylation microarray data.

GenomeStudio provides an internal control normalisation method for the 450K assay (Illumina, 2008), which is also used in MethyLumi (Davis et al, 2011) and Minfi (Hansen and Aryee, 2013); by default, Genomestudio uses the first sample in the array as the reference and allows the user to reselect the reference sample as needed if the original sample is nongenomic or of poor quality.

Quantile normalisation is one of the most commonly used normalisation techniques. Locally weighted scatterplot smoothing (LOESS) normalisation is an intensity-dependent normalisation method that assumes independence between the difference in log fluorescence signals between two samples and the average of the log signals from the two dyes (Siegmund, 2011). Quantile and LOESS normalisation (Laird, 2010) assume similar total signal across samples and can therefore remove true biological signal, because of the nature of DNA methylation described above, and have assumptions unlikely to hold for methylation data. As the Infinium I and II probe types examine different subsets of the genome, described in detail below, quantile normalisation cannot be applied indiscriminantly across probe types.

Lumi (Du, 2008), also used in HumMethQCReport (Mancuso et al, 2011), offers an alternative to quantile normalisation through a robust spline normalisation, which is designed to normalise variance-stabilised data by combining features of both quantile and LOESS normalisation (Du, 2008). Another approach, subset quantile normalisation (Wu and Aryee, 2010), normalises the data on the basis of on a subset of negative control or CpG-free probes that are independent of DNA methylation but suffers the same issues as other quantile approaches. The TurboNorm R package (van Iterson et al, 2012) provides an alternative to LOESS normalisation using a weighted P-spline intensity-dependent normalisation technique and can be applied to two colour arrays. A more recent method (Johnson et al, 2007), which we describe in more detail below, performs both normalisation and batch-effect correction. A comparison of different normalisation pipelines for Illumina 450K data can be found in two recent publications (Marabita et al, 2013; Pidsley et al, 2013).

Type I and II probe scaling

Another potential methodological concern stems from the fact that the 450K array uses two different types of probes, prompting the recommendation of rescaling to make the probe distributions comparable (Bibikova et al, 2011). Specifically, the 450K array has 485 577 probes, of which 72% use the Infinium type II primer extension assay where the unmethylated (red channel) and methylated (green channel) signals are measured by a single bead (Bibikova et al, 2011). The remainder use the Infinium type I primer extension assay (also used in the 27K Infinium array) where the unmethylated and methylated signals are measured by different beads in the same colour channel (Bibikova et al, 2011). Importantly, the two probes differ in terms of CpG density, with more CpGs mapping to CpG islands for type I probes (57%) as compared with type II probes (21%) (Bibikova et al, 2011). Moreover, compared with Infinium I probes, the range of β-values obtained from the Infinium II probes is smaller; in addition, the Infinium II probes also appear to be less sensitive for the detection of extreme methylation values and display a greater variance between replicates (Dedeurwaerder et al, 2011a).

The divergence in the methylation distribution range has implications for statistical analysis of the array data. For example, in a supervised analysis of all probes, an enrichment bias towards type I probes may be created when ranking probes because of the higher range of type I probes (Maksimovic et al, 2012). In addition, region-based analyses assume that probes within those regions are comparable, potentially untenable because of the diverging chemistries on the 450K array (Maksimovic et al, 2012). Moreover, when performing profile analyses or clustering, the differing chemistries between the two probes types may drive the clustering solution.

Attempts have been made to use rescaling to ‘repair’ the divergence between these two types of probes. The first correction method proposed was peak-based correction (Dedeurwaerder et al, 2011a), implemented in IMA (Wang et al, 2012), wherein the Infinium II data is rescaled on the basis of the Infinium I data assuming a bimodal shape of the methylation density profiles. However, several researchers have noted that this method is sensitive to variation in the shape of DNA methylation density curves and does not work well when the density distribution does not exhibit well-defined peaks or modes (Pan et al, 2012; Touleimat and Tost, 2012; Teschendorff et al, 2013).

Three alternative approaches have been proposed recently to address the limitations of the peak base correction approach. The first, subset-quantile within-array normalization (Maksimovic et al, 2012), is available in Minfi. Subset-quantile within-array normalization determines an average quantile distribution using a subset of probes defined to be biologically similar on the basis of CpG content and allows the Infinium I and II probes to be normalised together (Maksimovic et al, 2012).

The second, subset quantile normalisation (Touleimat and Tost, 2012), uses the genomic location of CpGs to create probe subgroups through which they apply subset quantile normalisation. The reference quantiles used in this approach are based on type I probes with significant detection P-values (Touleimat and Tost, 2012).

Finally, the β-mixture quantile dilation normalisation method, implemented in the wateRmelon package (Pidsley et al, 2013), uses quantiles to normalise the type II probe values into a distribution comparable to the type I probes using a β-mixture model fit to the type I and type II probes separately and then transforms the probabilities of class membership of the type II probes into quantiles (β-values) using the parameters of the β-distributions of the type I distribution (Teschendorff et al, 2013). This method uses a three-state β-mixture model but does not use fit to the middle ‘hemimethylated’ component in the normalisation; therefore, it does not require a trimodal distribution (Teschendorff et al, 2013). An advantage of BMIQ is that it avoids selecting subsets of probes matched for biological characteristics as done in the previous method and was found to be the best algorithm for reducing probe design bias in a recent paper (Marabita et al, 2013).

Rescaling using the methods mentioned above may be unnecessary when analysing 450K data on a CpG-by-CpG basis because the comparisons will be made at the individual probe level.

Adjustment batch/plate/chip/other confounders

DNA methylation arrays are susceptible to batch effects: technical remnants that are not associated with the biological question but with unrelated factors such as laboratory conditions or experiment time (Leek et al, 2010; Sun et al, 2011b). Normalisation has been shown to reduce some component of batch effects, although not all (Teschendorff et al, 2009; Leek et al, 2010; Sun et al, 2011b). Sound study design is critical for proper evaluation of and correction for batch effects: for instance, samples from different study groups should be split randomly or equally into different batches (Johnson et al, 2007). By properly correcting for batch effects, one can combine data from multiple batches, enabling greater statistical power to measure a specific association of interest (Johnson et al, 2007).

Several methods have been proposed to adjust for batch effects. ComBat uses an empirical Bayes procedure for this (Johnson et al, 2007), is robust to outliers in small sample sizes and can adjust for other potential confounders along with batch (Sun et al, 2011b). However, this method can be computationally burdensome and was initially developed for gene expression data; therefore, it requires a transformation of methylation data, which follows the β-distribution, to satisfy the assumption of normality.

Other R packages exist to adjust for batch effects. MethLAB (Kilaru et al, 2012) and CpG assoc (Barfield et al, 2012) allow the adjustment for batch using a mixed-effects model framework. However, because these methods do not directly adjust the data, unlike ComBat, they should be used only for a locus-by-locus analysis.

The array literature indicates that array position effects may also exist (van Eijk et al, 2012), and thus new batch correction techniques may be needed to take those into account. When phenotype distribution is heterogeneous across chips, which can occur in small samples even after randomisation, methods such as ComBAT can fail; in this case, linear mixed-effects models treating chip effects as random is an alternative.

However, in certain cases, the true sources of batch effects or confounding are unknown or cannot be adequately modelled statistically (Leek et al, 2010). In such cases, two methods, surrogate variable analysis (SVA) (Leek and Storey, 2007) and independent surrogate variable analysis (ISVA) (Teschendorff et al, 2011), also available as the ISVA R package, are very useful. Surrogate variable analysis estimates the source of batch effects directly from the array data, and variables estimated with SVA (SVs) can then be included into the statistical model as covariates (Leek and Storey, 2007). A modified version of SVA, ISVA, identifies features correlating with the phenotype of interest in the presence of potential or unknown confounding factors, which are modelled as statistically independent surrogate variables or ISVs (Teschendorff et al, 2011). This method could also be used for batch effects by constructing ISVs that are associated with these as potential confounders and including them in the analytical model. A problem with this technique occurs when the ISVs correlate both with the phenotype of interest and with the potential confounders, making model covariate selection difficult. Furthermore, ISVA and SVA do not directly adjust the methylation data, like ComBAT does, which may be problematic if the analytical goal is clustering. One could, however, fit a model with the estimated SVs or ISVs and compute the residuals for subsequent analyses.

Downstream analysis

Methylation status

Average β or the β-value is a commonly used metric to denote the level or percentage of methylation for an interrogated locus. Investigators also use the M-value, or log ratio, to measure methylation (Du et al, 2010):

A normalised M-value near 0 signifies a semimethylated locus, a positive M-value indicates that more molecules are methylated than unmethylated, whereas negative M-values have the opposite interpretation (Du et al, 2010). An M-value is attractive in that it can been used in many statistical models derived for expression arrays that assume normality (Du et al, 2010). However, β-values are much more biologically interpretable than their counterpart; furthermore, a recent paper found supervised principal components analysis (SPCA), as described below, to work better in the context of β-values as opposed to M-values (Zhuang et al, 2012). The relationship between the β- and M-value is captured by (Du et al, 2010):

Differential methylation/region-based analysis

Locus-by-locus analyses examine the relationship between a phenotype of interest and methylation of individual CpG sites across the genome, seeking to find differentially methylated sites. Differential methylation analysis aims to determine methylation differences between the specific groups (such as cases and controls), such as probe-wise or locus-specific methylation differences; the two terminologies are therefore equivalent when at the individual locus level. A very simple example is Delta B (Bibikova et al, 2011; Touleimat and Tost, 2012), where a difference is applied to two groups’ methylation medians for each CpG locus; if the absolute value of the difference in medians across samples of each group is higher than 0.2, then that locus is considered to be differentially methylated. This 0.2 threshold corresponds to the recommended difference in methylation between samples that can be detected with 99% confidence (Bibikova et al, 2011). MethVisual (Zackay and Steinhoff, 2010) tests whether each CpG site has independent membership between two groups using Fisher’s exact test; other packages include (Wettenhall and Smyth, 2004; Barfield et al, 2012; Kilaru et al, 2012; Wang et al, 2012), some allowing for the adjustment of potential confounders (Barfield et al, 2012; Kilaru et al, 2012; Wang et al, 2012). Minfi (Hansen and Aryee, 2013) uses linear regression and an F-test to test for a univariate association between the methylation of individual loci and continuous or categorical phenotypes, respectively. When sample sizes are <10, Minfi (Hansen and Aryee, 2013) has options for using limma (Wettenhall and Smyth, 2004). Specifically, limma uses an empirical Bayes moderated t-test, computed for each probe, which is similar to a t-test, except that the standard errors have been shrunk towards a common value. M-values should be used in these cases as, being based on a Bayesian Gaussian model, they will rely much more heavily on the Gaussianity assumption (Zhuang et al, 2012). The IMA package (Wang et al, 2012) allows site (methylation locus)-specific and region (all loci in a gene)-specific differential methylation analysis using Student’s t-test and empirical Bayes statistics. For region analysis, IMA will compute the mean, median or Tukey’s biweight robust average for the loci within that region and create an index (Wang et al, 2012). Methylkit (Akalin et al, 2012) allows for analysis at the site or regional level using logistic regression or Fisher’s exact test. With multiple samples per group, methylkit will preferentially use logistic regression, enabling also the inclusion of potential confounders (Akalin et al, 2012); to get stable estimates of the regression coefficients in logistic regression about 10 events per variable are necessary (Peduzzi et al, 1996).

Differential methylation analysis can also be performed by measuring variability between methylation loci as opposed to using statistical tests on the basis of differences in mean methylation (Xu et al, 2013). This is available in the EVORA package, allowing an investigator to use differential variability in methylation of CpGs and to then associate them with a phenotype of interest, such as cancer status (Teschendorff and Widschwendter, 2012; Xu et al, 2013).

As noted in several recent works, nearby CpG loci tend to have methylation levels that are highly correlated (Leek et al, 2010). As a result, statistical analyses that assume independence may be problematic. Methods are being developed to deal with this potential problem and include bump-hunting techniques (Leek and Storey, 2007), which take into account CpG proximity and borrow strength across neighbouring probes. Although these approaches were originally developed for CHARM assays, they may be adapted to the less-dense 450K array, pending careful attention to the tuning parameters for defining a ‘region’.

Although the above methods have proved successful in identifying individual CpG sites that associate with some phenotype/exposure of interest, the extent to which the methylation of these sites reflect true changes to the methylome or represent heterogeneity in underlying cell-type distributions depends largely on the tissue being sampled (Teschendorff et al, 2009; Houseman et al, 2012). We recently developed a set of statistical methods that exploit the use of leukocyte-specific DMRs for inferring changes in cell mixture proportions based solely on peripheral blood profiles of DNA methylation (Houseman et al, 2012). Under certain constraints, this approach can be used to approximate the underlying distribution of cell proportions among samples comprising a heterogeneous mixture of cell populations with distinct DNA methylation profiles (Houseman et al, 2012). This method has recently been used for predicting cell-type proportions, which were then subsequently added as additional covariate terms in a differential methylation analysis of rheumatoid arthritis cases/controls (Liu et al, 2013). Furthermore, the methods of Houseman et al (2012) were recently validated using a publicly available data set (Lam et al, 2012) that consisted of both PBMC-derived DNA methylation profiles and complete blood cell (CBC) counts for 94 healthy, non-diseased adult subjects (Koestler et al, 2013).

Clustering/profile analysis

Clustering refers to the grouping of objects into clusters, such that the objects within the same cluster are more similar compared with objects in different clusters. Owing to the interest in identifying molecular subtypes in the context of cancer, clustering has become a staple technique in the analysis of array-based DNA methylation data.

Two very well-known non-hierarchical methods used to cluster DNA methylation include K-means and K-medoids, also known as partitioning around medoids or PAM (Pollard et al, 2005). Two disadvantages of K-means are that it requires the prespecification of the number of classes, which is not often known; furthermore, K-means create clusters based only on the first moment, which is problematic in cases where the variance of a specific probe contains biologically important information. Another commonly used method to detect patterns in methylation data is PCA, which is a latent variable method often applied as a dimension reduction procedure and used for the detection of batch effects (Jolliffe, 2002). Principal component analysis was first applied to genome-wide Infinium HumanMethylation27 DNA methylation data as shown in Teschendorff et al (2009). Principal component analysis is used to develop a smaller number of artificial variables, called principal components, which account for most of the variance in the observed variables of a data set (Jolliffe, 2002); usually only the first few components are kept as potential predictors for statistical modelling (Jolliffe, 2002). However, additional principal components may be of biological significance as shown in Teschendorff et al (2009). A method to estimate the number of significant PCA components is available in the ISVA package (Teschendorff et al, 2011). This algorithm is based on the Random Matrix Theory (Plerou et al, 2002), which can be used to estimate the number of significant PCA components that are subsequently examined for their association with study-specific characteristics. Random Matrix Theory estimates the number of significant components of a data covariance matrix by comparing the statistics of the observed eigenvalues obtained from PCA with those obtained from a random matrix. The main disadvantage with PCA lies in the poor interpretability of the resulting principal components and the requirement of a large sample size in order to obtain reliable results.

Another well-known clustering method is hierarchical clustering, which builds a binary tree by successively merging similar samples or probes based on a measure of similarity (Eisen, 1998). However, because of its unsupervised nature, this form of clustering may or may not predict a phenotype of interest, as it does not use data beyond methylation to form clusters. Lumi (Du, 2008), HumMeth27QCReport (Mancuso et al, 2011) and methylkit (Akalin et al, 2012) all provide hierarchical clustering and PCA options using normalised M-values.

In addition to non-parametric techniques for clustering or profile analysis, Houseman et al (2008) developed a recursive-partitioning mixture model (RPMM), an unsupervised, model-based, hierarchical clustering methodology for array-based DNA methylation data. Recursive-partitioning mixture model assumes a β-mixture model to split samples between subgroups and provides an estimate for the number of clusters; furthermore, it is computationally efficient relative to the standard finite mixture model approach (Houseman et al, 2008). Owing to the inherent correlation in the methylation status of nearby CpG sites, there have also been efforts to incorporate correlation structures based on the proximity of CpGs in the context RPMM (Leek et al, 2010).

Semisupervised methods use both array-based genomic data and clinical data for identifying profiles that are associated with a clinical variable of interest, such as survival. Semisupervised clustering (SS-Clust) begins by identifying a set of genes that correlate with a phenotype of interest, followed by unsupervised clustering of samples based on the set of genes (Bair and Tibshirani, 2004). Supervised principal components analysis uses a similar methodology to SS-Clust, but replaces unsupervised clustering with PCA, providing a ‘risk score’ for each patient, which is then used as a continuous predictor of survival (Jolliffe, 2002). Semisupervised clustering’s main disadvantage is that it requires prespecification of the number of clusters; moreover, SPCA inherits the interpretability issues characteristic of PCA. Semisupervised RPMM (Koestler et al, 2010) has been shown to outperform SS-Clust and SPCA under certain circumstances and does not require the prespecification of the number of clusters.

One of the first attempts to discover novel tumour classes through profiling of methylation data involved a supervised method called support vector machine (SVM) including a cross-validation method to evaluate its prediction performance (Adorjan et al, 2002). This approach was initially very computationally intensive but was a precursor to other profile analysis methods. Another method, Elastic net, is a shrinkage and selection method, which produces a sparse model with good prediction accuracy, while encouraging a grouping effect (Zou and Hastie, 2005); this algorithm is now being widely used on all types of omics data (Barretina et al, 2012; Hannum et al, 2013) and was compared with SVM and SPCA in Zhuang et al, 2012 and shown to be far superior.

Pathway analysis

Many researchers use pathway analysis to characterise the function of the gene in which the individual or group of loci are found. Several software packages do this; however, we focus on two freely available resources that can also be used in R. The Gene Ontology (GO) provides a very detailed representation of functional relationships between biological processes, molecular function and cellular components across eukaryotic biology (Ashburner et al, 2000). Another resource that borrows heavily from GO is PANTHER (Thomas et al, 2003), which relates protein sequence relationships to functional relationships. However, many commonly used pathway analysis methods are based on gene expression correlation or protein–protein interaction; although pathway perturbations are likely to be evident in expression changes across all genes of a pathway, a single well-placed alteration of DNA methylation, acting as an epigenetic switch, may alter all downstream mRNA expression. In light of this, sensitivity for detecting significant pathways is lower for DNA methylation than it might be for mRNA expression. In addition, unlike mRNA expression, CpGs have different implications for expression depending on where they exist in relation to a gene or if they are mapped to any gene at all. As the 450K array has great heterogeneity with respect to the CpG representation by gene region, there is the potential for pathway analysis on 450K data to be biased by CpG selection. In addition, as genes are not equally covered throughout the array through the number of probes in their specific regions, this may further bias this analysis. Therefore, in using such approaches, we recommend stratification by gene region (e.g., promoter) to decrease the potential for bias. Once a specific region has been chosen, then pathway analysis, GSEA, or integration with interaction networks could be a fruitful procedure, as recently demonstrated in Dedeurwaerder et al (2011b) and West et al (2013).

Multiple testing correction

Once the analysis has identified top hits, multiple testing correction is necessary to reduce the likelihood of identifying false-positive loci by adjusting statistical confidence measures by the number of tests performed. Bonferroni correction consists of multiplying each probability by the total number of tests performed; this controls the family-wise error rate (Holm, 1979).

A less-conservative, widely used, approach involves controlling the FDR (q-value) or the expected proportion of false discoveries among the discoveries; this also uses a sequential P-value method (Benjamini et al, 2001); several R packages allow for the adjustment of the FDR (Barfield et al, 2012; Kilaru et al, 2012; Wang et al, 2012). All of the aforementioned methods assume statistical independence of the multiple tests, which can be violated when tests exhibit strong correlations (as mentioned above); furthermore, q-values imply subsequent validation in an independent sample, which may not occur. A potential solution to this independence assumption is with the use of permutation testing in which the phenotype of interest is randomly re-assigned, and the data reanalysed. CpG assoc provides a permutation testing option to obtain empirical P-values (Barfield et al, 2012).

Validation of significant hits

The final step in the proper processing and analysis of DNA methylation arrays is validation of significant hits by an independent experimental approach or data resource. The gold standard is bisulphite sequencing-based methods, such as pyrosequencing (Ammerpohl et al, 2009) and Epityper (Laird, 2010), to provide high-throughput quantitation (Siegmund, 2011). Another valuable resource for validation (and exploration) of DNA methylation array data is publicly available repositories such as the Gene Expression Omnibus (Edgar et al, 2002). Finally, with the availability of data resources such as the above and HAPMAP (Altshuler et al, 2010), researchers can now integrate their methylation array data with these resources, to help further understand molecular and genomic profiles that contribute to outcomes of interest such as cancer risk.

Conclusions

Owing to the plethora and complexity of methods for array processing and analysis, described above, and to the multitude of researchers using DNA methylation arrays, there is a need to create a protocol of good practice to ensure that study results are of the highest quality possible. Just as gold standard laboratory methods are crucial to the generation of quality biological data, gold standard processing and analytical methods are equally as important. Through the proper use of the processing and analysis flowchart described above, we hope that potential users will best harness these powerful array-based tools, which will in turn lead to rapid discoveries in human health and disease.