In 1995, the publication of the first complete genome sequence of a free-living organism generated huge excitement across many fields of research1. For microbiologists, as the organism in question was the bacterium Haemophilus influenzae, this milestone opened up the potential to address fundamental questions about bacterial pathogenesis. Since then, major advances in sequencing platforms, particularly the introduction of next-generation technologies, have resulted in a significant reduction in the cost of sequencing a bacterial genome (currently less than UK£50 per genome for Staphylococcus aureus (J. Parkhill, personal communication)), and some platforms now have a turnaround time of a day or less, but the ability to use the genome sequence alone to predict the potential for a bacterium to cause severe disease remains elusive.

The pathogenicity of a bacterium, or its ability to cause disease, is conferred by both the bacterium and the host, as it is a result of the interplay between the immune status of the host and the virulence factors encoded by the bacterium. Importantly, this interplay depends on how and when these bacterial factors are expressed. Defining the role of host immunity in disease outcome is crucial if tools to predict disease severity are to be built, but equally, we must be able to predict the virulence potential of a bacterial strain from its genome sequence. Although sequencing can list which virulence factor-encoding genes are present in a genome, without an understanding of the regulatory and epistatic processes that control their expression, the contribution of this list of genes to virulence cannot be quantified. With a more comprehensive understanding of the combinations of genetic backgrounds, regulatory networks and virulence factors that produce virulent strains, researchers might be better able to rapidly predict the propensity of a particular strain to cause severe and transmissible disease. In this Opinion article, we outline how a systems biology approach might just be the tool to help, using the important human pathogen S. aureus as a model.

Overcoming current limitations

Many specific definitions of systems biology exist. For the purposes of this article, systems biology is defined as an interdisciplinary approach that focuses on interactions in biological systems2. A typical systems biology approach is to describe the components of a biological system and how they interrelate by means of a mathematical model, which is then validated through iterative cycles of construction and then testing with experimental data from diverse sources, including the omics fields (such as genomics, transcriptomics, proteomics and metabolomics) and studies in classical genetics, biochemistry, molecular biology and structural biology. If the model holds up to scrutiny, then it can be applied to real-world situations to understand the emergent properties. The model can then also be used to predict how additional or external factors that affect individual components or groups of components within the system will affect the activity of particular parts of the system or of the system as a whole3.

The process of reducing a biological system from its rich natural complexity to a minimal set of interacting factors is a challenging concept, especially when experience in molecular biology tells us that the devil is often in the detail. In addition, to reduce complexity, assumptions must be made about the characteristics of the factors in the model, and this is again an uncomfortable concept for many molecular biologists, who are more used to building hypotheses on the basis of empirical data rather than assumptions. Systems biology is not an immediate or direct answer to the big questions faced by biologists, but rather an integrative and iterative approach that describes a biological system and then allows the gradual introduction of increasing amounts of complexity until the model reflects the system in the natural state. It is then that we can address the big questions, such as whether bacterial virulence can be predicted from genome sequence data.

Recent studies on important bacterial pathogens such as Pseudomonas aeruginosa4, S. aureus5 and Salmonella enterica subsp. enterica6 have identified important virulence genes by comparing the genetic makeup of virulent strains or serovars with that of either less virulent or avirulent strains or serovars. Such studies have greatly expanded our purview of virulence, generating vast amounts of data, but have also demonstrated that the presence or absence of individual virulence genes is not sufficient to predict the overall, or net, virulence of a strain. Examples of disease-specific toxins, such as toxic shock syndrome toxin of S. aureus, might seem exceptions to this rule, as genes encoding these toxins are always present in strains causing this type of infection. However, the presence of such a gene in itself is not indicative of disease outcome, as the same gene is found readily in asymptomatically carried strains. The effect of small genetic changes (for example, SNPs) in effector genes or in their regulators — changes that would be undetectable by PCR or microarray screens — must also be determined. Crucially, the role of epistasis (that is, the effect that mutations in one part of the genome have on the activity of genes elsewhere) must be considered. The effect of epistasis is well established for antibiotic resistance mechanisms7,8,9,10, but as a term it is less commonly associated with the expression of virulence genes in bacteria. However, the very existence of genes encoding global regulators of virulence genes demonstrates that epistasis is likely to have a significant effect on the net virulence of a strain.

To account for epistasis, any systems biology model of virulence must incorporate not only the virulence genes but also the regulators controlling their expression. Unfortunately, it is difficult to assemble gene-regulatory networks from omics data sets with a high level of accuracy because biological systems are often underdetermined. There is a growing number of studies that have constructed transcription-regulatory networks in microorganisms11,12,13,14,15,16,17,18,19,20,21,22,23, but even with large-scale omics data sets, there are usually more possible ways for genes to regulate one another than there are molecules with which to achieve such regulation. As a consequence, mathematical models can only characterize regulatory networks from omics data sets by making limiting assumptions (for example, that co-regulated genes must have similar functions). In addition, these studies typically involve one strain and/or one technique (for example, transcriptomics or proteomics), which also limits the ability of the model to be a general predictor of gene regulation. A good example of a study that begins to address some of these limitations is that of Yoon et al.23, who used both transcriptomic and proteomic data to identify novel proteins secreted by the single serovar S. enterica subsp. enterica serovar Typhimurium through the type III secretion system, and then used standard cellular and molecular biology approaches to verify the activity of these proteins. A good systems biology approach exploits multidisciplinary expertise and techniques to identify the minimum set of biological information needed to explain or define a system.

Although using systems biology methods to understand and predict microbial virulence may seem futuristic, this does not mean that such as goal is not possible. In this Opinion article, we argue that many of the necessary tools have already been developed and that, although the process would be labour intensive, the key to solving this problem lies in selecting more comprehensive scientific approaches that are designed to overcome limiting assumptions. If a model that predicts virulence from a genome sequence is to be built, then a broader perspective that extends from data collection to the construction of a predictive tool is needed. Here, we describe a framework to achieve this with currently available technology and resources, using S. aureus as a model organism.

Staphylococcus aureus as a model organism

S. aureus is an attractive organism with which to build a prototypical predictive model. This bacterium is a major human pathogen, and antibiotic-resistant strains, such as methicillin-resistant S. aureus (MRSA), are emerging worldwide24,25. Health care-associated MRSA (HA-MRSA) has caused problems in health care settings for many decades, but the recent emergence of strains referred to as community-associated MRSA (CA-MRSA)26,27, which cause infections in healthy individuals with no health care contact, is of increasing concern. If we are to develop and implement strategies to successfully treat infected individuals and block transmission to new hosts, we need tools to predict the virulence potential of emerging strains.

The virulence of S. aureus is well defined and is conferred by the activity of many effector molecules that interact directly with the host. These effectors can be grouped into three categories: adhesins28, which facilitate adherence to host tissues; toxins24,26, which cause specific tissue damage to the host; and evasins29,30, which interfere with host immune function. The phenotypes conferred by these factors are determined by the level of expression of the genes encoding them, which is in turn controlled by the activity of the virulence regulatory network. Virulence regulators can be either proteins31 or regulatory RNA molecules32. As more genetically diverse S. aureus strains are being studied, it is becoming increasingly clear that the regulatory networks are not uniform, and this illustrates the importance of understanding the epistatic interactions that occur between virulence regulators and virulence genes. For example, in many HA-MRSA strains, agr (the major regulatory system responsible for the density-dependent switch from the adhesive to the toxic phenotype) is inactive, making these strains more adhesive than toxic33,34. There are many other examples of genes encoding dysfunctional regulators in particular strains (such as sigB (encoding RNA polymerase σ-factor σB)35, saeRS36, sarT37 and sarU37), suggesting that the activity of each member of the regulatory network is likely to be a key factor in the virulence phenotype of an individual S. aureus strain.

The genome sequence databases are growing rapidly for S. aureus strains. Moreover, S. aureus effector molecules and their regulation are largely understood, and the organism is genetically tractable. Together with the general importance of S. aureus to human public health, and the ease with which the bacterium can generate new, successful clones, these factors make S. aureus an ideal model organism for developing a systems biology approach to virulence prediction, as described here.

The framework

The following is a description of a framework to generate a systems biology tool that predicts the virulence of an S. aureus strain from its genome sequence. Although the framework presented here is tailored to S. aureus, it could be applied to any culturable pathogen (Box 1).

Define the phenotypes that differentiate virulent and avirulent strains. The first step towards building a predictive tool is to identify the traits that differentiate virulent and avirulent strains. This can be done using currently available approaches such as omics, genetics, evolutionary genetics, biochemistry, molecular biology and structural biology. For S. aureus, there is a significant amount of data available concerning the different types of virulence phenotype that it displays (the toxic24,26, adhesive28 and evasive29,30 phenotypes outlined above), including the contribution of antibiotic resistance to these phenotypes27,28,29,35,36. There is also a wealth of data linking the expression and activity of these traits in vitro with their activity in vivo25,26,27,38,39,40. For S. aureus, many of these virulence traits can be quantified in multiwell plates, which means that phenotyping hundreds of individual S. aureus strains should be fairly straightforward. For example, adhesion to fibronectin — a trait that is known to contribute to the development of endocarditis and the formation of metastatic abscesses38,41 — can be assayed in 96-well plates in a couple of hours. The cytolytic activity of bacteria can be assayed using immortalized cell lines, also in multiwell formats34. These phenotypes can be clustered into classes that are sufficient to define virulence, and high-throughput assays can be used in this way to determine net adhesiveness, toxicity and evasiveness. These data can then be used to generate virulence indices for individual S. aureus strains, in which a strain could be, for example, highly adhesive, not toxic and moderately evasive.

The type of statistical analyses used in such a project will depend on the type of data generated (that is, it will be problem driven), but methods such as analysis of variance, principal component analysis and clustered permutation tests can be used to reveal associations between specific virulence indices and disease type and/or severity (details of which are available from the clinical data associated with each isolated strain). The virulence of subsets of these strains can also be measured in animal models that represent specific aspects of disease (for example, sepsis, wound infection or endocarditis) to test these associations. These approaches are well established, so their application to collections of clinical strains, rather than sets of isogenic mutant strains, is only a question of volume.

An illustration of the potential to use virulence phenotypes in vitro to explain disease outcomes in humans comes from two MRSA strains. The CA-MRSA USA300 strain, which corresponds to multilocus sequence type ST8, is known to be highly toxic and to cause a substantial burden of purulent disease in healthy individuals26,27,41. By contrast, an HA-MRSA ST8 clone that is dominant in the United Kingdom and Ireland causes chronic infections in susceptible hosts and has recently been shown to have traded off its toxicity for high levels of antibiotic resistance33,34. These examples demonstrate how differing phenotypes (high or low toxicity) can influence success in different environments (healthy or susceptible hosts) and could therefore be used as predictors of the disease potential, or pathogenicity, of individual strains.

Characterize how the relevant phenotypes are encoded. Gene surveillance studies in S. aureus have been used to make associations between combinations of genes encoding virulence effectors and specific disease capabilities5,39,40, but they have not yet proved robust enough to make predictions about the virulence potential of the strains. A more comprehensive approach, which builds on the previous step of the framework, is to determine the combinations of virulence effector and regulatory genes that contribute to particular virulence phenotypes (toxicity, adhesiveness and evasion) in different strains. Although the regulatory network in each strain is likely to be unique, this network will undoubtedly have elements which are part of a core regulatory network, common to all strains, and these elements can be revealed using advanced omics techniques such as differential network mapping42,43. This network can then be linked using statistical methods to the virulence index of the strain.

An extensive review of the literature has allowed a rudimentary depiction of the core virulence-regulatory network of S. aureus to be built (Fig. 1). The network consists of not only the 20 regulators that are known to have an effect on the virulence phenotype of S. aureus, but also the known effects of these regulators on the activity of other regulators in the network. A preliminary model of this regulatory network can be built using standard techniques. For example, by applying the network identification by multiple regression (NIR)44 method, the functional relationships of all known regulators are first expressed by a system of linear (or nonlinear) differential equations44,45, each describing the change in expression level of each regulator in response to individual perturbations (mutations). The system, or the underlying regulatory network, can then be inferred through multiple linear regression, or other iterative methods (such as MCMC19) that minimize the deviation between model prediction and experimentally determined expression levels.

Figure 1: The known virulence-regulatory network in Staphylococcus aureus.
figure 1

Inside the circle are all the regulatory genes shown to have an effect on each other and on virulence66,67,68,69,70,71,72,73,74,75,76,77,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96. Outside the circle are the known effects of each regulator on adhesiveness (A), toxicity (T) and evasiveness (E). Much of the data used to generate this image is qualitative. A question mark indicates that there is either no information regarding the direct activity of the regulator or the available information is conflicting.

However, the regulatory network depicted in Fig. 1 is currently limited by the fact that much of the available data have been generated in different laboratories, using different media and different S. aureus strains, at different time points of growth and using different methods (including northern blots, reporter fusions and quantitative reverse-transcriptase PCR). It is therefore difficult to compare these data directly. The network in Fig. 1 is also skewed towards certain regulators, according to their perceived importance and how recently they have been characterized. The data set is also incomplete; the lack of a connecting line between two regulators implies not that there is no interaction between these regulators but rather that these experiments have yet to be carried out. Therefore, the picture of how the regulators interact with each other remains incomplete, and the combinations of regulators that determine the virulence phenotype of each strain have not yet been determined. The network also does not include newly identified regulatory RNA molecules or account for the effects of post-translational modification. Nevertheless, it serves as an illustration of how a robust definition of such a system can be used as a starting point to which additional details and features can be added when their role in virulence is established.

Existing molecular techniques could easily be used to define this system more robustly; for example, constructing a library of isogenic strains in which each regulator is mutated would take approximately 6 months. The effect of each mutation on the genome-wide expression profile of the strain could be determined using RNA sequencing (RNA-seq) technology in approximately 6 months, and using high-throughput assays, the virulence phenotypes of 20 isogenic mutant strains could be determined in less than a week. So, although much of this work would be reproducing some previous findings, and therefore less rewarding, in our opinion it is not beyond the current technical capabilities. Network component analysis can be then applied to these data to build a model that represents all the interactions which occur in the system.

In addition to the different combinations of regulators found in different S. aureus strains, sequence variations and polymorphisms in the genes encoding individual regulators must also be considered. Such variability can substantially affect protein activity. For example, for a transcriptional regulator, a sequence alteration in the protein or the encoding gene could affect the abundance of the protein within the cell, the affinity of the target-binding sites, and the activity of the regulator when bound to a target. Bioinformatic analysis of the gene sequences of these 20 regulators (Fig. 1) in ten S. aureus subsp. aureus strains (MRSA252 (Ref. 46), Newman47, USA300 (Ref. 48), NCTC 8325 (Ref. 49), COL50, TW20 (Ref. 51), MSSA476 (Ref. 46), MW2 (Ref. 52), Mu50 (Ref. 53) and N315 (Ref. 53)) reveals a wide range of sequence variability between strains (Fig. 2). The most variable gene is agrD, which shows only 57% identity between strains N315 and MRSA252. At the other extreme, only sarA and sarR are 100% identical across all ten strains, suggesting that they are under extreme stabilizing selection. For all the other regulatory genes tested, the sequence identity is high across the ten strains (Fig. 2). SarS serves as a good illustration of how two nucleotide changes in the gene can significantly affect protein activity and how structural information can greatly inform this approach (detailed in Box 2). Other approaches, such as network component analysis and regulatory linkage analysis54,55,56, can be applied to characterize potential changes in protein activity as a result of SNPs in genes encoding transciption factors, as has been done previously in Saccharomyces cerevisiae56. These potential changes can be further verified by molecular techniques (such as expression of protein variants in null backgrounds followed by an assessment of protein activity) and fed into the mathematical description (that is, the model) of the regulatory network.

Figure 2: Staphylococcus aureus sequence variability.
figure 2

The sequence variability within virulence-regulatory genes across ten Staphylococcus aureus strains. Pi is the probability that nucleotides in a gene differ between individuals.

To fully account for this variability, and for existing systems biology models to be developed further, the data sets need to be expanded to include full genomic coverage. For S. aureus, at least, this should be possible, as large global collections of S. aureus strains are currently being sequenced57,58. The quality of the sequencing and the clinical data associated with each strain will be crucial if we are to make robust genome-wide associations between genome, virulence and disease outcome. But as genome sequencing is becoming faster and cheaper, such studies should become more common, providing a wealth of sequence data from which the variability in the virulence-regulatory network can be determined and indexed. This will facilitate indexing of the regulatory network, or the specific combination of regulators and their variability, for individual strains, and this index can then be linked to the virulence index.

Model validation and testing. When the virulence phenotypes have been characterized and how they are genetically encoded is known, the causal relationship between gene sequence and virulence can be examined using statistical approaches such as structural equation modelling (SEM)59 or perturbed-signalling-network modelling60. SEM differs from traditional linear statistical approaches in that it can examine complex pathways, for example, the influence of variable A on variable C through its influence on variable B. In our case, the aim is to model the effect of gene sequence on virulence through its influence on virulence phenotypes. SEM allows for the estimation of latent (that is, unmeasured) variables, which can be used to determine whether all of the phenotypes that contribute to virulence have been identified. Provided the research team has the appropriate mathematical expertise, we estimate that building the preliminary model would take approximately 12 months.

As mentioned above, the model of the regulatory network, which includes all the variability that exists, together with its effect on virulence, can be built from first principles with minimal complexity — that is, by initially including only known interactions. Importantly, however, robust validation of the resulting systems biology-based models is crucial. Although an initial model can be constructed using data from a set of 'starter' strains, this model must be validated using iterative cycles of data and data from an independent set of 'tester' strains. To do this, the regulatory index of a strain must be determined from the genome sequence. The predicted virulence index can then be compared to the actual virulence index, as measured empirically using the same assays that were used to define the index of the starter strains (for example, toxicity, adhesion and evasion). In addition to testing the predictive power of the model, this step will also help identify previously uncharacterized factors. If the predictive power is found to be poor (for example, only accurate for 50% of the tester strains), the genome sequences of the strains that do not fit the model should be analysed to identify any common factors that may explain this deviance. These factors could include the presence or absence of regulatory genes or small RNA molecules that are not currently considered in the model; the presence of specific SNPs in regulatory loci; the presence or absence of dominant effector molecules (for example, phenol-soluble modulin (PSM)-mec61, a small secreted cytolytic molecule that is encoded by the psm-mec locus and is believed to contribute to the virulence of CA-MRSA); or the presence of small encoded peptides that can be missed with current bioinformatic algorithms. When such common factors are identified, the effect of these factors on the regulatory network and on the virulence index can be determined empirically (that is, the gene can be mutated and the change in phenotype assayed) and then incorporated into the model. The refined model will then need to be verified with another independent set of 'tester' strains, followed by testing on new strains until the predictive power of the model is at a satisfactory level. The difference in the predictive success of the model for the first set of strains and for the final set can be used as a benchmark of progress.

Summary. There is already a considerable amount of data concerning the different virulence phenotypes displayed by S. aureus24,25,26,27,62. We also have a good understanding of how these phenotypes are regulated, and we are aware of the large amount of variation among the regulators and that this has important effects on the virulence phenotype of a strain. What we do not yet have is a detailed, robust and cross-comparable model of this virulence-regulatory network. Although this network is currently underdefined and improvement will be labour intensive, a more predictive model is not beyond current technical capabilities. With genome-wide transposon libraries of S. aureus strains becoming readily available (see the Functional Genomics Explorer of the Center for Staphylococcal Research at the University of Nebraska Medical Centre (UNMC), USA; Further information), the construction of mutants for such studies is no longer a limiting factor. What is perhaps most exciting is that genome sequencing, which will provide the data to allow such a project to come to life, is already underway57,58.

Can this be applied to other bacteria?

Several recent reviews have described the application of systems biology methods that, in the absence of epistasis, should be sufficient to map gene sequence to virulence63,64. We believe that these models could be improved by incorporating an understanding of how the genes interact with each other. Recent evidence suggests that problems such as functional redundancy, as well as problems caused by diverse combinations of genes resulting in similar phenotypes, apply to many bacterial pathogens of humans, including Mycobacterium tuberculosis7, S. enterica8, Escherichia coli9 and Pseudomonas aeruginosa10. We propose that these problems can be overcome by applying systems biology methods to many isolates, carefully validating these methods for the relevant species, and then using the resulting models to identify and predict the gene combinations that lead to specific virulence phenotypes and to predict the traits of a strain from its genome sequence alone. Although this type of project is likely to be challenging and will require the efforts of teams of scientists, the framework we outline here should prove useful for any microbial pathogen. Similar programmes of work are already underway, such as the Systems Biology Program for Infectious Disease Research3 (funded by the National Institute of Allergy and Infectious Disease, US National Institutes for Health), which is focusing on M. tuberculosis, influenza virus, severe acute respiratory syndrome coronavirus (SARS-CoV), Salmonella spp. and Yersinia spp., with the aim of shifting the paradigm of host–pathogen research and developing new ways to control these human pathogens3.


In the 17 years since the first bacterial genome was sequenced1 and the 12 years since systems biology was first launched as an experimental approach65, vast amounts of data have been generated that have provided a deeper insight into some biological systems. However, we do not yet have the ability to predict the virulence of a bacterial strain from its genome sequence. This limitation has many other contributory factors beyond those addressed in this article. Host susceptibility is a key factor in precipitating disease. Other factors such as intra- and interspecies competition during colonization and infection can also affect disease severity66,67,68,69,70,71,72,73,74,75,76,77. Nevertheless, despite the plethora of complicating factors, we believe that the approach outlined here provides a first step towards linking bacterial virulence to gene sequence using existing technologies. As it is rapidly becoming as cost effective to sequence the genome of an infecting strain as it is to send the strain to a routine diagnostics laboratory for identification and antibiotic resistance profiling, we need to find ways to interpret and make use of the sequence data obtained. Although sceptics might argue that the potential for systems biology to be used to predict virulence will not be reached for decades, in this Opinion article we have illustrated how this might be achieved for S. aureus using existing data and technology, and we believe that these tools can be built within the next 5–10 years. The framework presented here can be applied to any microorganism, but it will require multidisciplinary teams using large and diverse data sets and appropriate model validation. We think that the considered application of systems biology to understanding and predicting virulence could potentially revolutionize the way that existing and emerging global pathogens are investigated and controlled.