Introduction

The availability of high-throughput technology for capturing DNA sequence variation has spawned the growing field of vaccinomics, which attempts to interrogate the role of heritable host genetic variation on vaccine immune response and reactogenicity.1, 2 Although the scope of genotyping continues to grow increasingly comprehensive, most analyses to date in vaccine applications have focused only on the marginal effects of host polymorphisms, that is, the effect of genetic variants in isolation from other genetic and environmental factors. Recent genome-wide association studies have identified hundreds of variants associated with common diseases and traits; however, these genetic variants only account for a small amount of the inherited variation, leaving much of the variation to be discovered.3

In a recent study, we developed a more global approach for characterizing single-nucleotide polymorphisms (SNPs) in a genetic association study by treating the system as a phenotype-specific genetic association interaction network (GAIN).4 Small amounts of the inherited variation are encoded in numerous SNP nodes throughout this network in the form of main effects and gene–gene interactions. In this study, we introduce SNPrank, which we conceptualize as a random SNP surfer (RSS) that circulates through the GAIN, accumulating bits of information from each SNP with regard to the phenotype. SNPrank represents a modified eigenvector centrality algorithm that ranks the importance of each SNP through the complex interactions encoded in the GAIN.

The notion of centrality concerns the importance of a node in a network. For a social network, centrality quantifies how central or influential a person is in the network. Degree centrality simply counts the number of connections that a node has in a network (equation 4). However, not all incoming connections are equally important: some SNPs may exhibit strong interactions that yield considerable information about the phenotype, whereas other SNPs may show weak interactions but with numerous connections (a hub) that influence the phenotype. Still other SNPs may not be highly connected but may be highly relevant to the phenotype, analogous to someone being an authority on a given topic. Eigenvector centrality—the basis of Google's well-known PageRank algorithm for ranking web pages5—balances the quantity of connections with the quality. The aim of the SNPrank eigenvector centrality algorithm presented here is to balance the main effect and network interaction effects in the genotype–phenotype relationship encoded in the GAIN. Moreover, by accounting for indirect connections in the recursion process, eigenvector centrality allows for the possibility of including higher-order interactions when ranking the relevance of a particular variant.

Just as the connectivity of a person's social network depends on context—such as family, friends or business relationships—there is growing awareness that genetic effects will exhibit differences in network connectivity that depend on contexts such as the environment, the organism or particular antigen perturbation in a vaccine.4 Most current network approaches neglect the phenotypic context when defining connections between genes. For example, traditional coexpression networks do not account for the context of the phenotype because connections are based on the correlation between biomarkers across all phenotype states.6 Recent studies have suggested that improvements in predicting biological connections may be gained when the context is restricted to a more relevant space. For example, when mining gene expression databases for coexpression, Oldham et al.6 found that restricting inputs to biologically relevant sample types yields a more informative network. In their case study, restricting the database to gene expression experiments involving relevant tissue or sample types allowed them to identify known functional genes that are absent from the network generated from all available samples. Similar improvements have been suggested in predicting the biological relationships between diseases from the human phenotype ontology as the resolution of these ontologies become finer grained.7 The data-driven GAIN network approach used in this study is not biased by restricting correlation to a particular phenotype state, but rather calculates the interaction between SNPs, which is a conditional correlation that conditions all phenotype states in the experiment.

The quality of SNPrank importance scores will depend critically on the quality of the starting network. A familiar analogy with social networks further motivates the need for a data-driven, context-sensitive network. Consider a hypothetical network in which actors represent SNPs and the context of acting in a drama represents immune response to vaccine phenotype. If we considered the context of acting in a comedy instead of in a drama, the network connections and the relative importance of nodes will change somewhat. Similarly, we would expect a different network and SNP importance scores if we considered the context of immunization with influenza instead of smallpox. We would expect an even more divergent social network if we considered the context of acting in a musical; similarly, a SNP network for vaccine immune response would look very different from a network for major depressive disorder. Therefore, an advantage of the data-driven network constructed by GAIN is that it includes the context of the phenotype when defining gene–gene interactions between SNP nodes (equation 1).

When ranking the importance of a webpage URL U, Google's PageRank not only takes into account the number of connections that U has but also considers the connectivity of the pages connected to U. This can be understood through the random surfer interpretation of PageRank in which web-surfing behavior is simulated, with the surfer either following the link structure of the Internet or going to a random page. The more frequently the random surfer lands on a page, the higher the page's rank. Similarly, SNPrank simulates an RSS that walks through the GAIN Markov chain. The more often the RSS lands on a SNP, the higher its SNPrank. To guard against human manipulation of the rank of a web page, PageRank does not reward pages with self-referential links. In contrast, SNPrank includes self-connections that arise from the independent relevance of that SNP to the phenotype. In the connectivity matrix, these self-connections are calculated using the SNP's information gain I2 with respect to the phenotype (equation 2). Self-connections represent the probability of the RSS to stay at the current node during a simulation step. GeneRank is an eigenvector centrality method for gene expression data that uses either a coexpression network or a gene ontology (GO) annotation network to define connections between genes.8 The GeneRank Markov chain incorporates single-marker differential expression information, but the connectivity matrix is not sensitive to the context of the phenotype. SNPrank uses a data-driven, phenotype-sensitive network of main effects and gene–gene interactions inferred using the recently developed GAIN tool.4

In summary, the SNPrank transition matrix T (equation 5) accounts for single-locus main effects through the information gain I2 (equation 2) along the diagonal of the GAIN matrix G (equation 3). It considers pairwise interactions through the interaction gain I3 (equation 1) on the off-diagonal elements of G and higher-order interactions through the network sampling by the RSS. To continue the social network analogy with the phenotype analogous to acting in a drama, the aim of our importance score is (A) to identify actors who are important in the network because of their ability to connect to other actors in dramas, such as Kevin Bacon, and (B) to identify actors who have strong individual importance to dramatic acting, such as Marlon Brando. SNPrank is designed to identify SNPs that are important to the phenotype because of their ability to connect to other SNPs (Bacon SNPs), as well as SNPs that are important because of their individual importance to the phenotype (Brando SNPs).

SNPrank

In the following sections, we describe the mathematical machinery and steps of the SNPrank algorithm. In brief, we first describe the information-theoretic GAIN connectivity matrix. Then, on the basis of GAIN, we construct the SNPrank Markov chain, which the RSS surfs. The construction of the Markov chain is constrained by the need to balance main and interaction effects and by the requirement that the Markov chain be a stochastic matrix. Finally, having satisfied these constraints, we are able to use the iterative power method to solve for the SNPrank eigenvector. The components of the eigenvector are the SNPrank scores of the corresponding SNPs. The software implementing GAIN and the proposed SNPrank algorithm is available at http://insilico.utulsa.edu/Software.

GAIN connectivity matrix

The performance of SNPrank depends on the quality and context sensitivity of the underlying network. Thus, we characterize the main effects and interactions among SNPs by inferring a GAIN for a particular genetic association study. GAIN is based on the interaction information, I3, between three variables (in this case, between two SNPs A and B and the class/phenotype variable C):

The quantities I2(A;C) and I2(B;C) are the information gained about the class/phenotype C when locus A or locus B, respectively, are measured. The information gain between two discrete random variables X and Y is given by

The quantity AB in equation 1 is a joint attribute constructed from attributes A and B, with states given by the Cartesian product of the states of A and B. I3 is then the gain in information obtained about the phenotype when considering A and B jointly (I2(AB;C)) above the phenotype information that would be gained by considering variables A and B independently (I2(A;C)+I2(B;C)). We use I3 as the connection strength of each edge in the GAIN. Thus, each edge represents the increase in information about the phenotype achieved by considering the two SNPs jointly, compared with the expected increase in information with the assumption of independence between SNPs. We emphasize that a connection between SNPs in GAIN is specific to the given phenotype because it measures the correlation between two SNPs that influences association with the phenotype. In the current work, we weigh the nodes and edges of the connectivity matrix using information theoretic quantities. However, other data-driven weighting schemes are compatible with SNPrank, such as the main effect and pairwise interaction estimates from regression models. One may also augment or replace the data-driven GAIN connectivity matrix with a matrix based on gene annotation or pathway knowledge.

The GAIN connectivity matrix, with weights gij, will be used below to construct the probability for the RSS to make a transition from SNPi to SNPj in the network. Off-diagonal weights are defined as the interaction, I3, between two distinct SNPs, conditional on the phenotype:

and the diagonal is a vector of information gains, I2, or main effect associations of each SNP with the phenotype.

SNPrank Markov chain

We now derive a new variant of the eigenvector centrality algorithm to determine the importance of SNPs in a genetic association study by transforming the GAIN matrix into a Markov chain transition matrix, T. The construction involves multiple scalings to ensure that the final matrix T (equation 5 below) is a stochastic matrix (that is, , where N is the number of SNP nodes) so that the recursion procedure will converge. We begin the construction by considering the elements of the GAIN matrix (equation 3), which are used to weigh the probability of the RSS to make a transition from SNPi to SNPj in the network. We scale the elements of the GAIN matrix by column sums, which are the out-degree association fluxes of each SNP:

The PageRank matrix includes a probability γ to follow direct connections, γg/d, and a corresponding term with the probability (1−γ) that a random page will follow. The latter may be interpreted as a fatigue term that models an Internet surfer that has grown tired of the current thread of links. Similar to GeneRank, we modify this probability to be proportional to the correlation of each SNP with the phenotype. But rather than use differential expression, we use the information gain of each SNP, which is given by the diagonal elements of G, gii. This requires added scaling, involving the trace of the G matrix to constrain T to be a stochastic matrix. Also note that in place of the gii term, one could use expert knowledge if one wished to enrich for certain biological pathways. Depending on the interaction gain (I3) cutoff used, some paths in the network may lead to closed cycles or absorbing states. An absorbing state occurs when a jump takes place to a SNP j that has nowhere to go, that is, dj=0. Again, rather than use a uniform probability when dj=0, we weigh the transition to another SNP by the information gain I2 of each SNP (the diagonal elements of G). The elements of the stochastic transition matrix T become

where γ is the probability of the RSS to follow a gene–gene interaction-weighted path in the network and 1–γ is the probability of the RSS to remain at a SNP weighted by the main effect strength. We use γ=0.85 in our data analysis.

SNPrank eigenvector scores

Beginning with some arbitrary vector r(o), the RSS random walk is simulated by the repeated application of the transition matrix T:

The frequency with which the RSS lands on a SNP node during a simulation through the network is related to the SNPrank importance score, which is obtained by normalizing the frequencies to yield a probability field. The final SNPrank score of SNPi is given by the ith element of the stationary probability vector, rs, which is the vector whose the elements do not change under application of the transition matrix:

Typically, rs is obtained in the limit of a large number of transitions k; however, SNPrank only requires a few recursion steps to converge for our data set. Finding rs reduces the problem of finding the eigenvector with eigenvalue λ=1, which can be solved by the power method.9 The Perron–Frobenius theorem ensures that the eigenvector rs exists, and that the largest eigenvalue associated with the stochastic matrix is always 1. The power method recursively applies equation 6, with T defined by equation 5, until the eigenvalue converges to 1 to within some small tolerance. Below is the pseudocode outline for the power method of calculating the SNPrank eigenvector.

  1. 1

    Initialize SNPrank eigenvector r(o), and tolerance ɛ

  2. 2

    Begin Power Method k=0 while r(k+1)r(k) < ɛ normalize vector r(k) r(k+1)=Tr(k) k=k+1

  3. 3

    final eigenvector: rs=r(k)

  4. 4

    SNPrank of ith SNP = rs(i)

Fast convergence is found for the eigenvalue when the probability elements of the initial eigenvector r(o) are chosen to be uniform, 1/N, N being the number of SNPs. More informed initial guesses, such as the normalized evaporative cooling (EC) feature selection scores4, 10 or the many variants of Relief-F,11, 12 may further speed up convergence.

Application to smallpox vaccine antibody response

We expect a combination of genetic main effects and interactions to influence the immune response to vaccine.13 Thus, we illustrate SNPrank using SNP data from a study of the human immune response after smallpox vaccination. Genotyping was performed using a custom SNP panel based on the NCI SNP500 Cancer project that has been described previously.14 The majority of SNPs included in the panel target soluble factor mediators and signaling pathways, many of which have immunological significance. Of the 1536 SNPs assayed, a total of 1442 genotypes passed standard quality control filters (minor allele frequency >0.01, Hardy–Weinberg equilibrium P-value >0.001, and SNP call rate >0.95). Previously, we used this panel to investigate the association between SNPs and adverse event phenotypes after smallpox vaccination.4, 15, 16 In the current study, we investigate associations in this SNP panel with postvaccination antibody response. Specimens were collected from 136 smallpox vaccine-naive volunteers at baseline before vaccination and at 28 days after vaccination. Serum-neutralizing antibody responses were measured by plaque reduction neutralization assay as described previously.17 We used mixture model density estimation to categorize the neutralizing antibody phenotype into high and low immune response. This Gaussian discretization yielded 76 low and 60 high immune responders.

For our primary SNPrank analysis, we filtered the 1442 SNPs to the top 100 using EC feature selection4 to allow for the combination of main and interaction effects in the initial screen. EC is a machine-learning algorithm that combines the ability of Random Forests to detect main effects with the ability of Relief-F to detect interactions. Our simulation studies have shown that EC greatly boosts the power of Random Forests and Relief-F to detect a combination of effects that may be present in a real data set. More details may be found in the study by McKinney and colleagues.4, 10 Filtering with a univariate approach, such as logistic regression, runs the risk of eliminating interacting SNPs. However, as a secondary analysis to show the effectiveness of SNPrank in conjunction with a conventional genetic association analysis, we also filtered SNPs through marginal hypothesis tests based on logistic regression implemented in PLINK.18 We assumed an additive model of inheritance in all models, subsequently using the top 100 SNPs ranked according to P-value to generate the GAIN for SNPrank. The filtered SNPs from both approaches were highly congruent. We present results using the EC filter because of its ability to capture a spectrum of genetic effects. From the filter list, we apply GAIN to construct the data-driven, information theoretic interaction network, and then we apply the SNPrank algorithm.

Figure 1 shows a bar plot resulting from the SNPrank importance scores. The SNPs are ranked by their chromosome positions, with the height of each bar corresponding to its SNPrank score. The top SNPrank variant (rs1805352) is an intronic SNP in the retinoid X receptor α (RXRA) gene, which is known to be a mediator of vitamin D signaling and has recently been shown to be involved in innate immune response.19, 20 Figure 2 shows the smallpox vaccine-specific immune response GAIN. For clarity, we only show the top 40 nodes as ranked by SNPrank. The top three SNPrank nodes and their connections are highlighted: RXRA (red); neutrophil cytosolic factor 2 (NCF2, green); and cytochrome P450, family 1, subfamily a, polypeptide 1 (CYP1A1, blue). Being the most interactive SNP in Figure 2, the influence of RXRA SNP on the smallpox vaccine antibody response is primarily because of its being a hub in the GAIN network. In addition to RXRA, GAIN indicates the importance of other vitamin regulation pathway genes for immune response following smallpox vaccination.

Figure 1
figure 1

SNPrank importance scores. SNPs are ordered by basepair position, with chromosome boundaries denoted by dotted lines. The height of each bar corresponds to its SNPrank score. The top SNPrank variant is rs1805352 in the retinoid X receptor α (RXRA) gene.

Figure 2
figure 2

Genetic association interaction network (GAIN) for the top 40 SNPs selected by SNPrank as being the most relevant to smallpox vaccine-associated immune response. The top 10 SNPs (labeled in Figure 1) have bold labels. The top three SNPrank nodes and their connections are highlighted: RXRA (red), NCF2 (green) and CYP1A1 (blue).

Discussion

The important role of RXRA in our SNPrank network analysis of smallpox antibody response is noteworthy, given the findings in recent studies of the influence of this vitamin A- and D-signaling mediator on human immune responses. Results from a recent study of Rubella vaccination suggested that an intronic SNP in RXRA influences the magnitude and type of cytokine response following vaccination.20 Another recent study of CpG-activated human B cells showed that nanomolar concentrations of RXRA and peroxisome proliferator-activated receptor (PPAR)γ ligands increase antibody production.19 Taken together, these studies suggest that variation in RXRA function may explain in part the variability of human adaptive immune responses following vaccination. Furthermore, the findings suggest that RXRA, PPARγ and pathways related to these molecules could be exploited for development of new adjuvants that enhance antibody responses.

The smallpox vaccine-specific immune response network (Figure 2) reveals a consistent relationship between vitamin regulation and immune response genes. In the GAIN in Figure 2, RXRA has a direct connection with toll-like receptor 2 (TLR2). This further supports the evidence in Ovsyannikova et al.20 for Rubella vaccine that polymorphisms in toll-like and vitamin A and D receptors influence adaptive immune response to vaccines more generally. In Figure 2, RXRA has an intermediate connection to RXRB, which is also a mediator of vitamin D signaling. The second-ranked variant on the SNPrank importance list is CYP1A1, which is in the pathway of chemical reactions resulting in the formation of 9-cis-retinoic acid, a metabolically active vitamin A derivative. The direct GAIN connection of a variant in 5,10-methylenetetrahydrofolate reductase (MTHFR) to RXRA (ranked first by SNPrank) and to NCF2 (ranked third by SNPrank) suggests a role for folic acid metabolism in antibody responses. Understanding the interactions in GAIN may inform the development of new vaccines and immunotherapies, and these interactions may explain the variability of antibody response to other vaccines and infections.

Using gene set enrichment, we obtain additional weight for RXRA as a candidate biomarker for vaccine immune response. Specifically, we use GOrilla, which is a web-based tool that identifies enriched GO terms in a ranked list of genes.21 We use the EC rank list to compare how the genes would be prioritized by GO enrichment versus the data-driven SNPrank approach. The most significantly enriched GO category has RXRA along with CTNNB1, GPX1 and KRAS annotated with the term, muscle cell differentiation (GO:0042692, P=6.5e-4). The role of muscle cell differentiation in smallpox vaccine immune response is unclear because of the intradermal route of smallpox vaccine administration. Nevertheless, multiple lines of evidence suggest a role for RXRA in vaccine antibody response.

In a conventional analysis alone, the RXRA variant found by SNPrank would have been neglected because of its marginally significant P-value of 0.03. In our primary analysis, we filtered the data according to EC, which maintains main effects and interactions.4 However, we also performed a SNPrank analysis with a logistic regression filter to determine whether SNPrank could recover important interaction effects. Indeed, with the main effect filter, the rank of the RXRA variant was boosted by SNPrank to the top position because of multiple network connections based on gene–gene interactions encoded in the GAIN. The initial ranking of the SNPs by logistic regression and EC differed considerably; however, the top list was nearly the same after SNPrank was applied to both filters. This suggests that SNPrank was able to boost the main effect and interaction effects regardless of filter for this data. Certainly, if an association represents a pure interaction with no marginal effect, then a filter based on main effects will lead to a loss in power. For example, NCF2 was ranked third in our final analysis, but it did not meet the logistic regression P-value threshold. Similarly, RXRB did not pass the main effect filter, whereas it was found to interact with RXRA through an intermediate gene in our primary analysis. Thus, SNPrank with a main effect filter is able to generate novel biological knowledge from genetic association studies through network interactions, suggesting it is a reasonable alternative to more computationally intense filters coupled with SNPrank.

The SNPrank algorithm uses a Markov recursion matrix that couples the individual importance of SNPs (main effects) and their interactions with other SNPs on the basis of the data-driven GAIN connectivity matrix. The GAIN matrix is inferred using information theory, but currently does not include covariate corrections. The SNPrank algorithm itself does not depend on the technique used to construct the network. Thus, SNPrank may use a covariate-corrected GAIN or a network based on other statistical weights, such as linear or logistic regression. Fitted coefficients for interaction and main effect terms in the regression model may be used to construct the Markov chain, and important covariates can be included in the model.

The main effects and interaction effects of GAIN are coupled in SNPrank by the probability γ in equation 5. When γ=1, the SNPrank score depends only on connectivity. In this limit, the RSS agent samples SNPs by following connections based on gene–gene interactions (I3) that influence the phenotype, but independent effects of SNPs have little influence on the RSS. When γ=0, the SNPrank scores only depend on the main effect of SNPs, which is based on the information gain (I2) between each SNP and the phenotype. In this limit, the SNPrank scores become correlated with a main effect score such as those obtained from logistic regression. Clearly, these two extremes place too much emphasis on main effects and gene–gene interactions, respectively, in the GAIN connection structure. The developers of PageRank prescribed the value of γ=0.85, which makes a simulated random web surfer more likely to follow the link structure of the web instead of navigating to a random web page. For SNPrank, we find that the rank order of the top SNPs is robust to values of γ that are not too extreme (close to 0 or 1). As our aim was to discover new biological knowledge, we selected γ=0.85 to enrich for weak higher-order network interactions without completely neglecting main effects. SNPrank gives very similar rankings for γ in the range (0.5, 0.85). The optimum value of γ is likely data dependent and may be estimated through context-dependent GO networks, previous data or other expert knowledge.