Main

Increasingly, public-health officials are using pathogen genomic sequence data to support surveillance, outbreak response, pathogen detection and diagnostics1. Sequencing cuts across traditional pathogen boundaries; for example, it can be used to distinguish cases of ‘wild’ polio from vaccine-derived polio2, to predict the susceptibility of a tuberculosis infection to antibiotics3 or to trace the source of a foodborne infection1. Most recently, scientists and public-health agencies are using sequence data of the coronavirus SARS-CoV-2 to investigate the origin of this virus4,5, the global expansion of the epidemic6 and community transmission of COVID-19 in various localities7,8,9,10.

Because of its utility, public-health agencies throughout the world are developing their capacity to perform genomic sequencing. Almost every infectious disease program within the US Centers for Disease Control and Prevention generates and analyzes pathogen sequence data11. Many international public-health agencies, such as Public Health England, the Public Health Agency of Canada and the European Centre for Disease Prevention and Control, also have large sequencing programs. Capacity for pathogen genome sequencing has also grown within agencies at the state and local level. Indeed, every state public-health lab in the USA, as well as public-health labs in most major counties, conduct pathogen genome sequencing for foodborne surveillance, if not for other diseases as well.

Although laboratory capacity to generate sequence data has increased greatly, the capacity to assemble, analyze and interpret genomic data has been harder to develop. Possibly this is because many of the tools developed for sequence assembly and analysis either are expensive to license or require a high level of computational proficiency to use. Individuals with specialized training in bioinformatics and genomic epidemiology are relatively new to applied public health, and this workforce has not been distributed evenly across agencies. To grow analytic capacity, we believe that researchers must work from both ends: make bioinformatics and genomic analysis more accessible to non-specialists, and also build a larger workforce with experience in bioinformatics and genomic epidemiology within public health.

In this Perspective, we make recommendations for building a sustainable informatic infrastructure for pathogen genomics that can be used across public-health programs. We have centered our recommendations around what we feel are the fundamental characteristics of an open ecosystem for pathogen genomic analysis: reproducibility, such that genomic analysis is standardized and repeatable across agencies and through time; accessibility, at varying levels of both economic resources and technical knowledge; flexibility, providing a set of modular tools to analyze, explore and visualize genomic data across a range of public-health applications; and auditability, ensuring that genomic assembly and analysis can be validated according to strict public-health standards.

Our recommendations are not a checklist—rather, we aim to provide a structured view of what a public-health informatics ecosystem might look like (Fig. 1). We hope this work provides a starting point for the community to use to come together in designing and developing this ecosystem.

Fig. 1: Data processing, analysis and visualization.
figure 1

Left, data processing: First, bioinformaticians must process raw reads and assemble genomes. Within the data-processing side of this ecosystem, we envisage three types of databases: one for archiving or holding raw sequencing reads, one for archiving assembled data and one for holding metadata about the samples. Various current databases could fill these positions, or new databases could be developed if public-health programs require additional utility. We imagine that the Sequence Read Archive would continue to serve as the primary raw reads database. But, for instance, for a metagenomic sample containing both pathogen and human reads, the reads could be held in Illumina’s BaseSpace platform instead. From here, bioinformaticians could assemble genomes using open-access pipelines available from a cloud-based deployment platform; pipeline choice would be based around what type of assembly the user needed. Within each assembly pipeline, the final step should be automatic submission of the genome to the relevant database for that assembly type. This database could be an NCBI database (e.g., NCBI Nucleotide, NCBI Pathogen Detection), a member of the International Nucleotide Sequence Database Collaboration (e.g., DDBJ, ENA) or a pathogen-specific assembly database (e.g., GISAID, ViPR). Alternatively, if the genome sequence itself represents highly sensitive data, this database could also be a private repository available only to individuals within the public-health institution and vetted partners. The sequence data accession identifier should be deposited into a third database, the metadata database. In our design, the metadata database would be an in-house relational database that facilitates sample tracking and houses all relevant clinical and laboratory data according to a well-defined schema that can also accommodate long-form entries. Likely, it would be easier to licence databasing software for the metadata database than to build it from scratch. Importantly, metadata databases could also be secured, and could house relevant PII collected during epidemiologic investigations. Keeping these data separate from the genomic data will ensure that PII can be kept private when necessary. Data linking would occur via API calls: calls to the metadata database would pull relevant sample information and the assembly accession number, which an API would then use to source the genome assembly from the genomic database. Various metadata and genomic data combinations could be sourced depending on what data fields were necessary for the desired analytic or visualization pipeline. Right, data analysis and visualization: Once genomic assemblies and relevant metadata were combined, they could be piped to various analytic workflows, for example for predicting antimicrobial resistance, making specific data structures such as phylogenetic trees, or preparing datasets or data objects to serve as interactive data visualization platforms. We imagine that a wide array of different visualization and analytic pipelines will be in use; good APIs and complete, standardized metadata are necessary to support that breadth. Some analytic pipelines may be completely containerized, end-to-end workflows that produce visualizations or reports. Others could make data objects, such as phylogenetic trees, and submit them to an additional database for use in subsequent analyses. Additionally, analytic pipelines could make API calls to external databases, such as antimicrobial resistance gene databases, facilitating the integration of these new pipelines with existing software packages and databases.

Methodology for achieving a consensus

To investigate the current landscape of bioinformatics and genomic epidemiology in public-health agencies, we conducted a series of long-form, semi-structured interviews with bioinformaticians, laboratory microbiologists performing sequencing, software engineers developing pipelines and workflow-management software for public health, and epidemiologists acting upon inferences from genomic data. We aimed to get a broad perspective, interviewing individuals from different countries, working on a wide array of pathogens and working in agencies with varied capacity for performing genomic analysis. A full list of sources of interviewees is in Table 1.

Table 1 Agencies, programs and development teams participating in long-form interviews in the generation of this consensus statement

The interviews focused on the following topics: technical components of genomic analysis, considerations for genomic analysis specific to public-health settings and social issues surrounding genomic data. The interviews revealed various themes and consistent challenges related to supporting pathogen genomic analysis in public-health agencies. Our recommendations seek to address those challenges, and describe strategies for building a sustainable, efficient and effective bioinformatics infrastructure for the growing need in public health.

Although we conducted interviews primarily with the staff of public-health programs within the USA, our colleagues at the Africa Centres for Disease Control and Prevention led a concurrent effort to assess sequencing and bioinformatics capacity within African public-health agencies. We reviewed each others’ landscape analyses, finding many challenges within small public-health institutions in the USA that were similar to those that exist in Africa. To ensure that our recommendations would be relevant across income settings, our colleagues at the Africa Centres for Disease Control and Prevention reviewed the recommendations outlined here for their appropriateness to public-health settings in low- and middle-income countries.

Recommendations

In this section we describe our ten recommendations for supporting open pathogen genomic analysis in public-health settings.

Support data hygiene and interoperability by developing and adopting a consistent data model

Pathogen isolates need context. Who was the sample collected from? When was it collected? How was it collected? Without this information, often referred to as metadata, much of the value of the sample is lost, both from a clinical reporting and a data analysis standpoint. Despite the value of metadata, in current practice sequences are frequently decoupled from the full constellation of epidemiologic data and sample data that describe them (Box 1).

To ameliorate this problem, widely used genomic databases such as the Sequence Read Archive have standards and formatting requirements for submissions. However, we continue to face challenges of data incompleteness and lack of consistency in data reporting. Data incompleteness compromises the analytic utility of the data, and data inconsistency impacts users’ ability to interact with the data through computer programs. As the increasing amounts of data reduce our ability to manually interact with those data, both complete and structured data will be fundamental to an informatic ecosystem that works effectively and efficiently at scale.

To improve this situation, we recommend adopting a data model (Box 1) that specifies necessary data elements and provides an appropriate structure for linking sequence data, clinical data and epidemiologic information. The required data elements specified by the data model should be sufficiently flexible that they are applicable across a wide array of pathogens. To structure the data recorded within the data model, public-health programs will also need to adopt and/or develop ontologies: controlled vocabularies that standardize free-form epidemiologic information about cases and their exposures. Standardizing how data are recorded facilitates programmatic interaction with databases, enabling users to automate quality control and analytic procedures. Two good examples of epidemiologic ontologies are the Integrated Rapid Infectious Disease Analysis genomic epidemiology ontology, GenEpiO (https://genepio.org/) and FoodOn12.

Strengthen application programming interfaces

Application programming interfaces, or APIs, are the mechanism by which users communicate with computers, code and databases in an automated way. They are critically important for programmatic querying of databases, collation of disparate data sources and communication between pieces of software within a greater ecosystem.

The relative paucity of consistent and well-documented APIs for software tools and databases used by public-health bioinformaticians has at least two effects. First, the lack of APIs limits the scalability of bioinformatics analyses. Currently, querying genomic databases frequently requires human interaction via a web-based graphical user interface. However, with ever-increasing amounts of data, the ability to manually explore, source and distribute data will decline. Bioinformaticians at public-health agencies will need to automate querying and analysis; the quality of APIs will directly affect their ability to do this reproducibly and efficiently. Second, the lack of APIs leads to inefficient use of bioinformatician effort. When basic pipelines do not run automatically, or linking programs together requires considerable effort, bioinformaticians spend large amounts of time writing interstitial code and managing file format conversions. This takes up time that bioinformaticians and genomic epidemiologists could otherwise spend analyzing the data, probably with greater public-health impact.

The development and use of well-documented APIs will underlie the success of a software ecosystem within public health, and cannot be an afterthought. We recommend that public-health institutions adopt common API standards and carry out API development in tandem with database or software development. For the many software programs and databases that already exist, specific funding sources should be allocated to build or extend current APIs to function with the agreed-upon data models and adhere to adopted API standards (Box 2). Notably, the US General Services Administration has developed standards for APIs, which provide a concrete starting point in the development of APIs for genomic and epidemiologic databases. These standards are described in detail at https://github.com/GSA/api-standards.

Develop guidelines for management and stewardship of genomic data

The increasing abundance of longitudinally collected pathogen genomic sequence data is a valuable resource for public health. To fully realize the value of this data, however, programs will need to manage and care for the data in a unified manner. To this end, public-health institutions should develop and adopt guidelines and standards for data collection, annotation, archiving, and reuse. The community should design these guidelines to ensure that data adhere to FAIR principles: that is, they are findable, accessible, interoperable and reusable13. Following these principles ensures that once generated, data can be reused in the future.

We recommend that guidelines describe which data to archive, including both raw and assembled data such as consensus genome sequences; the duration of archiving; systems for long-term archiving; and the intended use and long-term value of the data and appropriate metadata standards. Archiving practices must be responsive to the requirements and priorities of individual health jurisdictions, but we recommend that wherever possible, agencies prioritize keeping data easily searchable and shareable.

Make bioinformatics pipelines fully open-source and broadly accessible

Currently, commercial software can provide off-the-shelf bioinformatics capabilities to laboratories with limited in-house capacity. However, licensing proprietary software can be prohibitively expensive, especially in low- and middle-income settings. Though they are perhaps less obvious, proprietary software also has limitations in high-income settings; although it may be economically accessible, using proprietary software may reduce transparency about how data are processed, and it limits customization of bioinformatics pipelines.

To facilitate broad access to standardized bioinformatics across income settings, we recommend developing and maintaining a deployment platform of open-source pipelines for bioinformatics assembly and genomic analysis. We describe a model of this deployment platform in greater technical detail in Box 3. Within this ecosystem, we recommend that bioinformaticians use open-source software packages within pipelines, and that they deploy full pipelines openly. Pipelines should output to common, non-proprietary file formats, and bioinformaticians should build them transparently in an environment that supports user feedback and issue tracking. To ensure that limited informatic training is not a barrier to use, frequently used reference pipelines, such as those used for molecular surveillance of foodborne pathogens, should be accessible via web-based entry portals with graphical user interfaces.

We note that access to standardized bioinformatics does not mean limiting the number of workflows available. Rather, it means ensuring that we build software and workflows upon widely accepted standards in a way that is transparent and auditable by the community. If interoperability between tools and openness of the entire system to sharing and review are prioritized, we believe that a balance point will be reached at which there are sufficient tools and workflows to support the analyses public-health agencies want to perform, without leading to a proliferation of redundant tools and workflows (Box 3).

Develop modular pipelines for data visualization and exploration

A large proportion of genomic data interpretation relies on data visualization, such as the creation of phylogenetic trees. However, the current process for making and refining these visualizations is inefficient. Genomic data are frequently separated from epidemiologic data, and most public-health bioinformaticians will not have access to demographic and exposure information for the individual who is the source of the data. This means that bioinformaticians cannot easily analyze epidemiologic and genomic data jointly to create integrated visualizations.

Additionally, visualization pipelines typically run as a monolithic series of computations that start from raw sequencing reads and end with a single image, not a genomic data object that can be visualized and explored in multiple ways. This image is generally shared over email, and manually annotated with epidemiologic data. Highly collaborative teams seeking to integrate their genomic and epidemiologic interpretations may repeat this cycle of generating images and then annotating images many times over; this is time consuming and potentially error prone, and it may limit which analyses can be performed, reducing public-health utility.

We recommend taking a more functional, modular approach. Firstly, analytic and visualization pipelines should be separated from assembly pipelines. This separation allows genome assembly to occur on lower-security scientific computing servers in the absence of epidemiologic data. The separation also provides an added benefit that if a bioinformatician wishes to rerun an analysis, they do not need to redo the genome assembly. After assembly, bioinformaticians could join epidemiologic data housed on secure servers with the assembled genomes. If APIs are used to source data, different levels of security authorization could be required to access different components of epidemiologic data. Notably, for data joining to work, structure and consistency provided by data models and ontologies will be needed. Finally, rather than exporting a single image, analytic pipelines could export data objects for interactive visualization in browser-based portals, increasing epidemiologists’ and bioinformaticians’ capacity to explore the data together.

Interpreting genomic data is not always intuitive, which can make communicating findings to multidisciplinary public-health teams challenging. To improve data interpretation, we recommend developing analytic tools further, so that they properly account for uncertainty in the sampling process, and developing new ways to convey uncertainty within genomic data visualizations. The widespread use of genomic data in public health is relatively new, and many public-health practitioners do not have a background in genomic epidemiology. Thus, researchers must ensure that data exploration and visualization tools effectively capture and convey uncertainty to experts and non-experts alike.

Improve the reproducibility of bioinformatics analyses

As often occurs in academic settings, public-health programs routinely use similar, but distinct, pipelines for bioinformatics analysis. Although most pipelines use a relatively narrow suite of open-source software programs, the lack of standardization across bioinformatics pipelines affects the comparability of data and results across agencies.

Sequencing assays in public health must be sufficiently robust and reproducible to meet government-regulated standards. This need for stable software and reproducible analyses should drive how bioinformatics pipelines are developed, maintained, hosted and tested. To meet this need, we recommend using version control to manage datasets and pipelines, containerizing code and requirements, auditing pipelines, using workflow management software and developing validation criteria for assessing bioinformatics assembly against known standard datasets (Box 4).

Utilize cloud computing to improve the scalability and accessibility of bioinformatics analyses

As the scope of genomic surveillance grows, so too will the volume and complexity of data generated during routine public-health laboratory operations. For many public-health institutions, the assembly and analysis of next-generation sequence data already depends on advanced computing infrastructure for data capture, analysis and storage. To better support the current needs, as well as to plan for the future, we recommend developing the public-health bioinformatics ecosystem as a cloud-based system. We imagine that the cloud-based system would be hosted centrally, probably by a federal public-health agency, which would reduce the number of high-performance computing environments needed to support broad access to bioinformatics. That way, not every institution would have to purchase server hardware nor pay the highly remunerated workforce necessary to maintain a high-performance computing cluster. Instead, smaller agencies could pay only for their usage of the cloud-based ecosystem, and even this could be reduced if computing were entirely centrally funded. Agencies could manage costs more efficiently due to the inherent elasticity of cloud computing. Computing power could be scaled up in times of high demand, such as outbreaks, and scaled down when demand is low to reduce costs.

Centralized management of a broadly accessible resource would also allow agencies in smaller jurisdictions, or in low- and middle-income countries, to support sophisticated bioinformatics capabilities without incurring substantial capital or operational expenditures. Broadening the access to bioinformatics could help build capacity within small frontline public-health agencies, thereby reducing lag times during outbreak response. Broader access would also enable smaller agencies to investigate priority diseases at the local level.

In addition to scalability, accessibility and potential economic benefits, a cloud-based bioinformatics ecosystem could also improve the reproducibility of bioinformatics analysis. To run on the cloud, code should be containerized (Box 4). If most agencies and programs use the same pipelines, their results will be more comparable.

Although cloud-based computing holds great potential for public health, we would be remiss if we did not mention one formidable obstacle: connectivity issues in low- and middle-income countries (Box 5).

Support new infrastructure and software development demands with an expanded technical workforce

Tomorrow’s public-health workforce should include new technical specialties. To support the computational infrastructure necessary for broad-scale public-health genomics, programs could benefit from personnel with expertise in managing high-performance computing infrastructures, storage engineers who manage databases and networks, and software developers. To support the analysis of growing amounts of complex data, public-health agencies could benefit from additional bioinformaticians, genomic epidemiologists and data scientists.

Attracting this workforce may be challenging. Lower compensation than in the private sector, lack of access to newer technologies and the different culture of working within a government agency could prevent computationally oriented personnel from pursuing careers in public health. Emphasizing the ability to improve lives may increase recruitment, however.

Beyond recruitment, public-health programs should consider retraining as a way to build this workforce. Increasingly, laboratory microbiologists are pivoting towards more bioinformatics-heavy roles, often by learning these new skills on their own. With their incredible wealth of knowledge about the upstream sequencing process, former microbiologists have a unique perspective that could improve troubleshooting and evaluation of bioinformatics analyses. Bioinformatics training programs now exist across resource settings—for example, H3ABioNet (https://www.h3abionet.org/), ELIXIR-Tess (https://tess.elixir-europe.org/), GOBLET (https://www.mygoblet.org/) and Australian BioCommons (https://www.biocommons.org.au/training)—and these could serve as models for public-health bioinformatics training. Although public-health programs should design multiple courses tailored to different skill levels, possible topics could include command-line interfaces and common platforms for bioinformatics analysis, interpreting quality control metrics for whole-genome sequence data, bioinformatics methods for genome assembly, and the theory and practice of comparative genomic analysis.

Once such a workforce is developed, public-health agencies will also need to retain its members. Currently, many agencies lack formal job descriptions specific to computational disciplines, competency and assessment criteria, and mechanisms for computational personnel to advance into leadership roles. To sustain a computational workforce, public-health agencies should create clear descriptions of the disciplines and job series for bioinformaticians, data scientists and software engineers within public health.

Retaining computational personnel will probably be more challenging in low- and middle-income countries, and we expect that the recommendations above will be insufficient. As discussed by Folarin and colleagues14 in relation to retaining African scientists in genomic research, retention will likely require coordinated governmental support, sustained funding and infrastructure development. Public-health practitioners in low- and middle-income countries will know best how to approach building and retaining capacity, and we defer to their knowledge and experience. We simply note that understanding how to retain computational personnel in low- and middle-income settings is critically important to developing an informatic infrastructure that can work across all income settings.

Improve the integration of genomic epidemiology with traditional epidemiology

Neither epidemiologic case data nor pathogen genomic data are as powerful on their own as they are when integrated and analyzed together in a timely and actionable manner. From a technical perspective, this integration will require more sophisticated databasing approaches, including programmatic data sourcing and merging that respect security levels, use of ontologies to standardize data reporting formats for both surveillance data and genomic data, and machine-learning methods for data classification, tagging and cleaning.

Even with the necessary technical requirements in place, however, effective integration of epidemiologic and laboratory data will also require frequent and open communication between surveillance epidemiologists and bioinformaticians. We believe that this communication could be improved if bioinformaticians and epidemiologists could, to a certain degree, speak each others’ languages. We recommend that public-health agencies train surveillance epidemiologists in the basics of interpreting genomic data and, likewise, teach bioinformaticians the basic concepts of epidemiology. For example, a course for surveillance epidemiologists could clarify the applicability of genomics to their work, describe how genomic data are generated and discuss possible epidemiologic interpretations from comparative genomic analyses. Similarly, bioinformaticians may not always understand how epidemiologic questions and study design shape sample selection for sequencing. Thus bioinformaticians may benefit from courses that describe epidemiologic study designs, common analytic techniques in epidemiology and principles of public-health surveillance.

We also recommend creating integrated teams that include genomic epidemiologists, bioinformaticians and surveillance epidemiologists. We imagine that harnessing this expertise across disciplines will strengthen epidemiologic interpretations, as public-health officials will make inferences from multiple data sources, and the strengths and weaknesses of each data source will become clearer.

Develop best practices to support open data sharing

In an interconnected world where disease transmission occurs across borders, environments and species, the best surveillance system would support data sharing across institutions and agencies, both within and between countries. Ideally, all genomic data and non-identifiable metadata would be shared openly between all agencies, with data release occurring rapidly after data generation, once data are in a reasonably reliable draft form. Although harder to share, personally identifiable information (PII) can be critically important to understanding an outbreak. We recommend sharing PII along secure and trusted channels to the extent that it is important for understanding disease dynamics and guiding public-health responses.

Although we advocate for the described degree of openness as a best practice, we recognize that these standards would be nearly impossible to implement within our current public-health system. Some diseases, such as HIV, will require stricter constraints on data sharing for as long as they remain stigmatizing. Other infections are rare, allowing one to rapidly triangulate from non-identifiable data to PII. Finally, although public-health programs rightfully must follow rules that govern how PII are shared, these regulations often make data sharing convoluted, because definitions of PII vary by disease incidence and geography, and laws governing the use, storage and transmission of PII vary by jurisdiction.

In order to develop a data sharing system that functions well for public health, we think that data sharing needs to be easy to do, so that it is not a burden; occur along trusted channels; and be granular, so that access to different levels of data can be filtered based on security and legal constraints. We emphasize that the development of increased data openness in public health cannot be all or nothing; if it is, we will simply end up with a system in which sharing is limited. Instead, we should identify consistent small steps that programs can take to improve the openness of data, with the hope that open data and integrated databases improve surveillance and outbreak response sufficiently to warrant their continued development and maintenance (Boxes 6 and 7).

Our vision of a potential software ecosystem

Given our proposals, and the software tools that currently exist, we imagine that our proposed system would be highly modular, with genomic assembly and data processing separated from the genomic analysis and visualization processes. Splitting these processes will maintain efficiency while allowing flexibility, enabling many different analyses to be performed without the need to rerun assembly pipelines. Importantly, separating the assembly and the analytic processes will also ensure that output from the assembly pipelines is archived, an important extension to current archival practices, which focus primarily on storing raw sequencing reads. The primary pieces of this ecosystem would be databases, APIs, pipelines and scripts that move data around (Fig. 1).

Conclusion

The shift toward extensive use of pathogen whole-genome sequencing represents a turning point for public-health agencies; programs must pivot to accommodate a new data source that provides increased resolution for understanding disease dynamics, but requires different tools and a changing workforce to support. Now is the time to build community and consensus, to invest in developing a system that is broadly accessible and that will work for years to come.

Our recommendations provide a starting point for these discussions. The efforts to realize an open ecosystem for public-health bioinformatics will be guided and supported by the Public Health Alliance for Genomic Epidemiology (PHA4GE), an organization that we, along with many others from the public-health bioinformatics community, launched in 2019. PHA4GE is a global coalition that is actively working to establish consensus standards, document and share best practices, and improve the availability of critical bioinformatics tools and resources. Through its work, we hope to see greater openness, interoperability, accessibility and reproducibility in public-health bioinformatics.