Main

The accurate classification of pathogens with epidemic potential can optimize communicable disease control and reduce associated costs1,2. Recognition of the usefulness of rapid genotyping for this purpose has led to a call for closer interplay between epidemiological surveillance and disease-management strategies3. The application and interpretation of genetic typing in clinical and epidemiological studies requires not only an understanding of the typing techniques involved, but also efficient integration of the results into clinical and public health decision-making4,5.

Clinical genomics and bioinformatics have been dominated by eukaryotic paradigms in which genomic rearrangements typically denote dysfunction. However, prokaryotic genomes, particularly those of bacteria, have a mosaic structure and can vary significantly, even within a species; it remains unclear, therefore, how microbial genomic data should be processed so that they are easy to interpret, accessible and easy to share. There is a growing mismatch between the volume of microbial genome data available and the ability to automate its systematic analysis and interpretation6,7. In this Perspective we outline selected approaches to the translation of pathogen genotyping and microbial genomics into formats that can be incorporated into communicable disease management, surveillance and control. Further, we introduce the concept of pathogen profiling as a tool for disease management in public health.

Moving beyond the phenotype

Pathogen profiles. Analysing the dynamics of infections that have epidemic potential relies on the accurate demarcation and identification of individual strains or epidemic clones, together with the identification of specific virulence factors and other validated markers. Together, this information can be consolidated into a pathogen profile, which comprises information derived from traditional phenotype-based methods, such as bacterial culture identification (often based on biochemical properties and antibiotic resistance), and other information, such as that derived from nucleic-acid-based techniques. Nucleic-acid-based techniques include various high-throughput epidemiological typing methods that have the capacity to simultaneously identify and analyse multiple selected regions within a given pathogen genome and are relatively new to mainstream clinical microbiology8,9.

The argument that a species-based description of pathogens has inherent limitations is not new. Many bacterial species contain different strains that are associated with distinct clinical features and epidemiology, and which cannot be distinguished by traditional means4,10. Strains of the same species can vary by as much as 35% in either the complement or number of unique genes present and sometimes have significant variation within individual genes. For example, the sizes of the Escherichia coli and Salmonella enterica chromosomes can vary by more than 1 Mb and 300 kb, respectively11, and most bacterial species are a mosaic of different subpopulations. In many bacteria the characteristics that determine pathogenicity for hosts are encoded on mobile genetic elements that are transferred between strains at different rates. Organizing bacterial strains into clonal complexes rather than traditional species groupings is therefore often more relevant to clinicians and is better suited to epidemiological analyses. For example, the diversity of hundreds of distinct Campylobacter jejuni strains, as defined by multilocus sequence typing (MLST), is represented by 17 clonal complexes, six of which comprise more than 60% of the strains isolated from human campylobacteriosis12.

The heterogeneity of pathogens, hosts and the environment means that no single characteristic can adequately reflect the clinical and epidemiological complexity of infection or reliably predict the outcome(s). The systematic construction of pathogen profiles from a combination of genomic or other 'omic' markers in a manner that enables data to be integrated and shared, is essential for successful surveillance and disease management13. Consider, for example, an infection that is potentially caused by several different strains of the same species, each of which has different sets of virulence factors that can be distinguished by genotyping. If the optimal management strategies varied for infections caused by different subtypes, then rapid subtype identification would optimize disease management. For example, antibiotic resistant strains of Mycobacterium tuberculosis , detection of which indicates potential therapeutic failure, can be identified using genetic markers2,8. Similarly, evidence from the monitoring of HIV or hepatitis C virus (HCV) infections supports this approach14,15 (Boxes 1,2).

Table 1 Classes of determinants for pathogen profiling

Profile attributes. A pathogen profile is a single, multivariate observation (or set of observations) that is composed of classes of specific attributes, for example, genome, transcriptome, proteome or metabolome data, which are designed to allow interrogation of existing (or future) databases (see Further information; Table 1), and integration with clinical observations and patient outcomes (Fig. 1). The profile can indicate the probability that a specific marker is associated with a clinically relevant phenotype, such as in vivo antimicrobial resistance or high transmissibility. This information would allow classification of strains into risk groups for either treatment failure or a propensity to cause outbreaks. It is often important to also capture quantitative information about a pathogen in vivo, for example, viral or bacterial loads and their units of measurement.

Figure 1: Interaction of the different 'omes' in a microbial cell.
figure 1

Each 'ome' is a complex function of the other 'omes', and the amount of integration increases from the bottom to the top.

In contrast to traditional subtyping, which is based on phenotypic characteristics such as serotype, biotype, phage type or antimicrobial susceptibility, genetic profiling describes the phenotypic potential in the nucleic acid sequence. Genotyping systems that are based on comparison of sizes and numbers of different DNA fragments separated by gel electrophoresis — pulse field gel electrophoresis (PFGE), or nucleic acid amplification-based typing methods such as restriction fragment length polymorphism (RFLP) or random amplified polymorphic DNA chain reaction (RAPD) — have been less reliable than direct sequence-based methods, due to a lack of precision and reproducibility16. Sequence-based typing and RAPD, plasmid fingerprinting or PFGE can be viewed as examples of direct and indirect methods of assessing nucleic acid sequence, respectively. All of these methods provide both strain typing and phylogenetic data2,17,18 that can be processed using sequence alignment and clustering techniques and are amenable to standardization and database cataloguing. The derived information often correlates well with clinically relevant phenotypic characteristics, such as virulence19,20,21. Typing systems that use markers with specific or binary values, including MLST, are more reproducible and are therefore more appropriate for pathogen profiling19,20. Such typing systems enable classification of pathogens that are relevant to the investigation of chains of infection transmission and are useful tools for studies of global epidemiology18. Detailed descriptions of molecular typing techniques that are used for epidemiology studies can be found elsewhere20,21,22.

Selection of attributes. The choice of attributes used to construct a profile depends on the clonality of the species, the function, diversity and rates of change of chosen genes, and their clinical or public health relevance. As a rule, microbial profiles should include key molecular markers that are potentially associated with specific patient outcomes or risk factors, and antimicrobial resistance markers. Profiles of different types of viruses and bacteria can differ significantly as there is no unique or common template or genotyping method that can capture all of the attributes required to describe all types of microorganisms. Some genome profiling techniques are based on conserved genes — genes that are associated with metabolism or other 'housekeeping' functions — whereas others target variable genes that are often associated with virulence20. Virulence determinants are frequently present on transferable genetic material, such as plasmids, pathogenicity islands and bacteriophages, with genetic histories and dynamics distinct from those of the conserved genes of the host bacterial population.

The specific disease and the type of control measures influence both the clinical relevance and discriminatory power of the typing system that is used for profiling and the level of statistical significance that is required to identify clustering23. Microbial genotyping alone might not always be the correct classification method as outbreaks are occasionally caused by several different agents, rather than a single, virulent clone; for example, sewage contamination of water or food could cause an outbreak of diarrhoea. Therefore a combination of genomic and phenotypic microbial characteristics and comparison of genotypic clusters with those identified by epidemiological investigations, is important in outbreak investigations. Using a combination of methods can enhance the discriminatory power and precision of microbial profiling24 and might be required to define genotypes that are composed of conserved and variable portions of the genome, but would increase the cost and the complexity of data interpretation and sharing.

The task of defining which information to include in the pathogen profile is non-trivial and is becoming even more complex as the number and scope of molecular typing methods increases and are linked with treatment and public health decisions25. The nature of clinical reports of antimicrobial resistance illustrates this problem26. Currently, clinical microbiologists usually report the pathogen name and antibiotic susceptibilities, but few, if any, other details. In future, routine reports could include predictive prognostic markers such as a calculated post-test probability based on the pre-test information. For example, interpretative reports of antiretroviral susceptibility testing might include information about mutations and cumulative sensitivity scores to rank the likely efficacy of individual drugs and combinations27.

A pathogen profile is a synthesis of different markers and clinical end-points that can be extracted from medical charts and that characterize an individual patient's clinical and public health outcomes. The profile can be heuristic, when only a single genetic marker is associated with a specific patient outcome, however greater insight can be achieved when attributes from different levels of the biological hierarchy (that is, gene detection, gene expression, metabolite profiles and so on) corroborate and complement each other. Large-scale genotyping generates valuable information that can be translated into databases to search for strain-specific epidemiological markers or to construct an evolutionary history of strains for a particular epidemiological catchment area. This objective becomes greatly simplified if the genomic data are categorized, archived and electronically portable so as to facilitate access, retrieval and comparisons. The task of designing, capturing and correlating pathogen profiles can be assisted by the development of a standards-based representation of attributes and pathogen-specific ontologies.

The medical and cost benefits of highly integrated, comprehensive disease-control programmes that include routine microbial genotyping have been demonstrated28,29, yet incorporating multiple data sources remains a technical challenge16. The need for models that define data elements in communicable disease informatics, and the relationships between them, have been identified30,31. Microbial profiles provide data models with discrete elements amenable for standardization. Figure 2 illustrates such a data model by demonstrating the relationships between meticillin-resistant Staphylococcus aureus (MRSA) as a concept (object) and the determinants of its pathogen profile. However, the vocabulary of profiling data (the words or individual components), syntax (the 'sentence' structure) and messaging protocols are yet to be developed. Healthcare vocabularies such as the UMLS (United Medical Language System, National Library of Medicine), LOINC (Logical Observation Identifier Names and Codes, Regenstrief Institute) and SNOMED (Systematised Nomenclature of Medicine, College of American Pathologists)32,33 provide integration mechanisms for high-level terms used in medical charts (for example, tuberculosis) with the relatively low-level terms used in the clinical laboratory (for example, Mycobacterium tuberculosis Beijing Family spoligotype).

Figure 2: Relationships between MRSA as a concept (object) and determinants of the pathogen profile.
figure 2

This data model defines major classes of attributes for an MRSA profile (for example, genotyping methods, virulence factors and clinical outcomes) and relationships between them. blaZ, β-lactamase gene; drfA, trimethoprim resistance gene; Ent, enterotoxin; erm, macrolide resistance gene; Et, exfoliative toxin; femA, gene encoding a cytoplasmic protein necessary for the expression of meticillin resistance; Luk-PV, Panton-Valentine leukocidin; mecA, gene encoding PBP2a, the low-binding-affinity penicillin-binding protein that mediates meticillin-resistance; MRSA, meticillin-resistant Staphylococcus aureus; SCCmec, Staphylococcus cassette chromosome; spa, staphylococcal protein A gene type; ST, sequence type; tetK, tetracycline resistance gene; tst, staphylococcal toxic shock toxin gene; vanA, vanB, vanC, vancomycin resistance genes.

Successful initiatives that have focused on common interchange standards in genomics and proteomics, such as minimum information about a microarray experiment (MIAME), minimum information requested in the annotation of biochemical models (MIRIAM)34 and minimum information to describe a proteomic experiment (MIAPE)35,36, should be informative in the push to integrate databases in the management of disease. These projects have introduced formats to enable the unambiguous interpretation of results and aim to ensure that experimental results in genomics, proteomics and metabolomics are deposited in public databases before publication, as has already been long established for nucleotide sequences. The Pathogen Information Markup Language (PIML) has also been recently introduced to enhance the interoperability of microbiology datasets for pathogens with epidemic potential31 by capturing the data elements that describe determinants of pathogen profiles.

Matching profiles. Once a profile has been constructed for a strain, it can be matched with those of others or with existing datasets using similarity measures and clustering techniques (see Supplementary information S1 (box) for a list of microbial databases). Sequence similarity or genotype matching of microorganisms implies a common lineage rather than a unique identity, in contrast to eukaryotic DNA matching. Different distance functions for phylogenetic assessments and clustering algorithms have been applied to reveal or compare microbial patterns in bacterial or viral fingerprints (for example, Euclidian distance or Pearson correlation, index of diversity, approximate matching heuristics and information theoretic similarity measures)37,38. For example, Simpson's index of diversity estimates the probability that two unrelated strains will be placed into two different typing groups38. The closer this numerical index is to 0 the higher the chance that two microbial profiles match.

Alternatively, the level of reported similarity between sequences, which can indicate biological relationships, can be measured as E values (expert value) which range from 0 (100% identity), or close to 0, to larger numbers which indicate lower similarity. The relatedness of isolates can be visualized using dendrograms that are based on unweighted pair group methods with arithmetic means (UPGMA) for small numbers of isolates or clustering, for example using eBURST, for larger datasets39. The eBURST algorithm, which was developed for the interpretation of MLST results, first identifies mutually exclusive groups of related genotypes in the population, then identifies the group's founding genotype, predicts the descent — from the founder — of other genotypes, and shows the output as a radial diagram, centred on the predicted founding genotype. The computational power required and the confidence limits used depend on the number of markers and their diversity within and among species, and the number of representative samples. Computational pattern matching and validation techniques have received little attention in the biomedical literature so far40,41.

Uses of pathogen profiling

Knowledge discovery from databases. Although the number and range of data relevant to microbial profiles have increased, they do not characterize the entire phenotype of a pathogen in an environmental or experimental context. Linking systematically annotated profiles with clinical and research databases can identify previously unrecognized associations between phenotype, genotype, environment and host responses and, potentially, the specific genes that govern them42. Functionally linked genes or proteins have been identified by examining connections between them, using computational methods like the Rosetta Stone43,44, Phylogenetic Profile45 or Operon46. Networks, created by relationships among phenotype, disease expression, environment and experimental context and associated genes with differential expression, could provide new insights into microbial interactions and pathogenesis47,48,49. This approach has been fruitful in metagenomics50 and information management systems designed to assist with genotyping or functional genomics are now being developed51,52. For example, in silico analyses that combine molecular phylogeny and targeted sequencing have identified possible target genes for antimalarial treatment53 and predicted candidate antigens for vaccine development (reverse vaccinology54).

A great deal of data that are relevant to microbial profiling already exist. Public electronic bacterial typing databases such as MLSTNet, PulseNet, the BioPortal and SPOTCLUST, among others, use web-based formats that allow universal access and matching of bacterial or viral isolates to each other and to those represented in databases. More recently, structured polymorphism databases have been built, yet data sharing and integration remain difficult, due to the lack of common structures47,55. Several hundred public domain molecular biology databases are currently online but few contain raw data. Most represent the efforts of individuals to organize, annotate and interpret data from other sources. These databases are highly valued and are increasingly expected to replace paper publication as the medium of communication46. Some are classification databases (for example, the Staphylococcus aureus spa typing tool or the SPOTCLUST database for Mycobacterium tuberculosis genotyping). Critical factors that distinguish the best databases include networks of subscribers willing to share data, the availability of statistical algorithms to analyse these data and the quality of the curation process.

MLST and PulseNet are good examples of advanced databases. At the core of the MLST concept is the provision of freely accessible nucleotide-sequence databases, which function as a common dictionary to enable direct comparison of bacterial isolates without requiring the physical exchange of cultures. In this sense they provide the basis of a common language for bacterial typing45. In contrast to archival databases such as GenBank, MLST databases are curated for accuracy. To overcome some limitations of the first MLST stand-alone web sites, a new network-based database (MLSTdB-Net) has been implemented with more than 30 MLST schemes, for different bacterial species. It is hosted at 33 websites to ensure greater computational power and better analytical performance. Some of the MLST websites allow researchers to run and curate their own schemes remotely. The PulseNet system, which is based on PFGE patterns, is the most developed system for the characterization of bacterial isolates with a fingerprinting approach. It is one of the few networks that integrate epidemiological and typing data over wide geographical regions45,50.

Antimicrobial therapy optimization. The great diversity of mutational patterns contributing to antimicrobial resistance complicates the choice of optimal therapies. A range of bioinformatics tools, which are designed to predict drug resistance or response to therapy from genotype, have been developed to provide clinicial support. These tools use either a statistical approach, in which the inferred model and prediction are treated as regression problems, or machine learning algorithms, in which the model is treated as a classification problem17. A statistical learning approach to ranking of therapeutic choices often relies on a direct correlation between baseline microbial profile, the therapeutic decision and response to treatment, for example, expected reduction in viral load resulting from anti-HIV combination therapy (Box 1). Several susceptibility scores have been used for combination antiretroviral therapy that take into account specific resistance mutations and add up the activities of individual drugs in the regimen27,56,57. Computer-assisted therapy is an attractive way to reduce the complexity of prescribing antimicrobial combinations. It highlights the need for databases that can be widely shared, and that allow correlation of quality-controlled data from genotypic resistance assays and treatment regimens with short- and long-term clinical outcomes. Differences in antimicrobial sensitivities reflect variation in amino-acid composition of resistant microorganisms, but simply counting mutations is not enough to detect most functional differences, which affect treatment outcomes. The data links between laboratory and clinical databases will unlock the full utility of microbial profiles.

Efficiency in outbreak investigation and disease monitoring. The genetic signatures of pathogens enrich the accuracy and predictive power of laboratory experiments2,3. Microbial typing can confirm or refute putative epidemiological links among and between cases and potential environmental sources, and therefore might trigger public health investigations. Alternatively, typing studies can demonstrate that putative clusters are unrelated and so rule out the need for further action. However, the usefulness of pathogen profiling goes beyond specific questions related to the investigation of possible outbreaks. It can also be used for disease monitoring, by identifying transmission and associations between microbial types and clinical outcomes41. Molecular profiling can assist in the assessment of the reproductive number (R0) of an infectious organism during epidemics, in making infection control policies more organism-specific41 and in predicting clinical outcomes. For example, multiple isolates of the same pathogen that have indistinguishable profiles, which are highly clustered in time and space, would suggest an outbreak and trigger an epidemiological investigation supplemented by a social network analysis of patients involved. This could potentially identify a 'superspreader' — an individual who is responsible for 80% of transmission events58. Evidence suggests that, for some infections such as severe acute respiratory syndrome (SARS) that have epidemic potential, public health control strategies that are focused on 'superspreaders' would be three times more effective than the random interventions currently used58.

Molecular typing also facilitates the detection of chains and patterns of infection transmission and the construction of epidemic trees3. For example, by distinguishing tuberculosis (TB) due to recent infection from reactivation, typing allows the assessment of current rates of active transmission in a community and hence guides appropriate control efforts. Molecular typing has led to a reassessment of the role of casual contacts in the transmission of TB59. Specifically, a two-stage TB contact tracing strategy, based on clustering of genetically related M. tuberculosis isolates, can improve the identification of epidemiological links and prevent more cases of secondary infections in low prevalence settings, and therefore augment traditional contact tracing59,60. This capacity of pathogen profiling is especially important as changes in contact patterns often underlie the re-emergence of disease.

Early warning for population health and infection control. A particularly exciting prospect is the integration of typing databases with epidemiological information, potentially producing global real-time epidemiological surveillance of pathogens that have epidemic potential61,62. There is increasing evidence of the value of rapid molecular profiling in assisting outbreak detection and hospital infection control26,28,29,63. For example, rapid outbreak detection by routine MRSA spa typing is a potential alternative to traditional approaches to hospital-acquired infection control28,63. In a prospective study, automated clonal alerts, which were based on real-time spa typing of hospital MRSA isolates and temporal-scan test statistics, were 100% and 95.2% sensitive and specific, respectively, in identifying outbreaks and were more sensitive and timely than routine surveillance by infection control nurses63.

In such an 'on-line' surveillance system, novel and previously characterized strains can be compared, grouped by cluster analysis and depicted as dendrogram or multidimensional graphs to simplify the presentation of complex time–space relationships. Spatial surveillance, using emerging geographical information systems, will enhance the ability to measure the extent and variables of an outbreak in space and time and the power to detect localized events64. The output from these systems ultimately needs to be integrated into clinical and diagnostic processes. Real-time data sharing, especially of genotypes of microbial isolates from different animal species as well as humans (for zoonotic infections) and from different jurisdictions or countries, could enhance rapid response using input and action triggers provided by multiple diagnostic, veterinary and public health laboratories and other partner organizations.

Concluding remarks

In this Opinion we have identified some of the major steps that are needed to generate and translate accessible genomic information about pathogens of clinical and public health importance. The synergistic use of high-throughput molecular testing, with advanced machine-learning approaches, has already redefined several traditional classifications of cancer65. A similar approach has started to affect communicable disease control. The concept of pathogen profiling described here provides a framework for data integration and sharing to ensure that the flood of data from new molecular technologies will be used effectively in public health surveillance and disease management.

We argue that diagnostic pathogen profiling will help to predict patient outcomes and identify markers that can be used for early diagnosis and to predict and monitor treatment responses. Pathogen profiling to identify individual genetic variation, along with a detailed knowledge of polymorphisms, will allow tailored interventions, a process commonly referred to as 'personalized medicine'. The potential value of pathogen profiles can be shown by, for example, the use of HIV and HCV genotyping to direct the choice of antiviral therapy, or specific genetic signatures in cancer tissue or host immune responses to predict outcomes27,31,57.

There are, however, many challenges in producing useful pathogen profiles. The methods used to generate input data and standards for sharing data are still evolving. A shift of emphasis towards integrative data analysis and sharing is difficult, but might prove to be the key to the successful translation and integration of laboratory diagnostics into improving clinical and public health outcomes in medicine.