Biomedical and clinical research have become increasingly data driven. Transforming large amounts of data into new discoveries requires cutting-edge analytical approaches, as well as new infrastructure to provide a foundation upon which algorithmic advances can build. Greater collaboration with outside fields such as software engineering and computer science has driven new advances in computational biology, with experts in these domains working alongside biomedical researchers and clinicians to acquire cross-domain expertise.

Published papers are increasingly dependent on algorithms and software that underpin the reported research. With the increasingly foundational role of computational approaches in biomedical science come challenges associated with reproducibility of results and robustness of underlying code. A 2016 survey of 1,500 scientists found that over 70% had tried and failed to reproduce another scientist’s experiments1. That same year, the FAIR (findability, accessibility, interoperability and reusability) guiding principles were published2, aimed at enhancing the reusability of scientific data. Transparency of software code is a prerequisite for reproducibility and is necessary for understanding the provenance of research data and insights. Improving the transparency of methods and interoperability of data will engender a rapidly growing need for well-engineered solutions that transcend a single lab and can be used by a large number of scientists.

Research software engineering is an emerging field focused on addressing these core challenges through a unique skill set that enhances the value and usage of scientific data. Research software engineering can facilitate interdisciplinary science and accelerate translational research through efficient data management and equitable data provision.

An emerging role

Research software engineering combines professional software engineering expertise with an intimate understanding of research. The focus is on delivering best practices through the application of foundational software engineering practices such as version control, testing and automation, while at the same time ensuring the data output remains scientifically valid and accurate. The research software engineer (RSE) speaks the language of professional engineering and understands fundamental research methods. From this unique position, RSEs can think differently about research questions and spur innovative solutions that scientists and data analysts alone might not reach.

The application of professional software engineering is critical for the scaling and reproducibility of scientific output, especially as researchers grapple with the sheer size and volume of datasets, as well as an abundance of different analytical methods. The impact of research software across science is huge; consider, for example, the industrial scaling of centralized genome browser resources such as Ensembl that revolutionized the biosciences with massive infrastructure and engineering projects.

The concept of the RSE has existed for only a decade and has grown rapidly, establishing the importance of the discipline across various scientific domains3. Since the idea was first proposed at an event at The Queen’s College, Oxford, in March 2012 (ref. 4), the movement has spread to a substantial international community, with ten established associations in the United Kingdom, mainland Europe, Africa, Asia, Australia and North America. In the United Kingdom, at least 38 universities have their own centralized RSE groups that researchers can use to access skilled software professionals to develop the software tools they need for their research.

Through support from organizations such as the Software Sustainability Institute, the RSE community has helped develop several initiatives that champion open science and reproducibility in the life sciences. Any researcher who writes code, such as bioinformaticians, can align with RSE communities and benefit from exposure, training and peer support. Other resources such as The Turing Way5 provide handbooks for reproducible, ethical and collaborative data science. As awareness improves, RSEs are being increasingly embedded within research teams, and this in turn increases accountability and enhances trust in the scientific results delivered by software. Ways to engage, receive training from and work effectively with RSEs are in Table 1.

Table 1 How researchers can engage and work effectively with research software engineers

Data-driven science

Emerging technologies and big data open up exciting new opportunities for scientific discovery. Artificial intelligence has the potential to extract new actionable insights from the complexity of human health and disease, with prospective applications in biomedicine, including image-based diagnostics and the discovery of new, more effective treatments. Emerging computational approaches with the potential to transform biomedicine must be underpinned by robust and scalable software, ideally from professionals who sit between research and technology, as exemplified by AlphaFold, an artificial intelligence system developed by the DeepMind laboratory and the European Molecular Biology Laboratory’s European Bioinformatics Institute to provide open access to over 200 million protein structure predictions.

Although research software engineering can play a crucial part in the research lifecycle, the recognition of its importance does not yet match that of data generation and analysis. That being said, RSEs are a key driver of research success, dissemination and impact. By investing in the adoption of FAIR principles throughout the data pipeline and extending those principles to software6, RSEs can transform research data output from being seen as a final resting place into a dynamic, collaborative resource in an active ecosystem of tools and infrastructure.

Team science

The research landscape is seeing increasingly large interdisciplinary collaborations across institutions, which often generate high-impact research7. Approaches that integrate biological and clinical knowledge lead to innovations for improving health outcomes. Modern science relies on many people with many different skills to conduct research, from community managers to people who produce training materials.

One example of global collaborative science is the Human Cell Atlas initiative, which aims to characterize and map every cell type in the human body. This international consortium has over 2,000 members in over 80 countries and invests in building capacity through multidisciplinary teams that champion open science, including software engineers focused on data storage, sharing, browsing and dissemination. Their data and findings are shared openly with the broader scientific community, which accelerates discoveries and deepens collaboration among researchers around the world. RSEs played a fundamental role in the rapid coordination and deployment of the consortium’s centralized COVID-19 data portal.

Although many funders support software development, less money has historically been available for the critical work of software maintenance. Fortunately, a growing number of funders are seeking to address this problem. Schmidt Futures recently announced the creation of the US $40 million Virtual Institute of Scientific Software to fund the maintenance of researcher-written code8. The Chan Zuckerberg Initiative, a philanthropic organization that is dedicated to building the future of science by funding efforts such as the Human Cell Atlas, has also pledged US $40 million through its Essential Open Source Software for Science program. This provides support for ongoing maintenance of widely used open-source scientific software that is critical for maintaining the ecosystem, which is often overlooked by discovery-science-funding mechanisms.

Driving translation

RSEs drive clinical translation of research findings. By delivering data through web applications, for example, RSEs remove the technical burden from clinicians, students, investigators and industry partners. As the only requirement is internet access and a web browser, this substantially improves global, equitable access to research data. Similarly, if data visualization and analysis tools are more readily available through intuitive point-and-click interfaces, research teams around the world can collaborate more easily. The development of open-source scientific tools and resources for single-cell biology data, such as the Chan Zuckerberg cell-by-gene platform, the Human Developmental Cell Atlas and the Cambridge Cell Atlas, helps scientists explore and visualize high-dimensional single-cell datasets with which to derive scientific insights. Tools and resources such as these empower researchers to access data when it suits them, and facilitate collaborative research to improve data generation, analysis, biological interpretation and the clinical application of research findings. Essential design considerations for the success and development of these web applications are in Box 1.

Although the scientific landscape has changed and new roles and expertise have become increasingly important, the judging of research excellence has not kept pace. Assessment of research success tends to focus on individual achievements, such as a published article or a successful grant, instead of cumulative progress that can lead to breakthroughs. Mechanisms that reward individual researchers inevitably undervalue those in roles essential to collective research projects, such as RSEs, lab managers and research technicians. That being said, there are new awards and accolades that are focused on recognizing the contributions of all roles within research. The Hidden REF campaign, for example, celebrates all research output and recognizes everyone who contributes to its creation9. However, these initiatives are frequently organized by volunteers, so their impact will remain limited until they can attract greater awareness and funding.

There is a need for greater acknowledgment and support for emerging roles (such as research software engineering) from all stakeholders, including funders, research organizations, learned societies and researchers with traditional scientific backgrounds. An important first step is the regular citation and acknowledgment of software and the contributions made by research software engineering to scientific papers10. Many researchers do not know that software is citable; frameworks such as the CZ Software Mentions Dataset11 elevate software to a research output. Proper credit for software tools and their utility is key to ensuring the role of RSEs is fully understood and recognized by the broader scientific community. If researchers cite software they use in publications, and encourage training and peer support for software development, they might then build the skills and confidence to publish their own software. This supports collaboration and allows others to cite the software, leading to a cultural cycle of valuing software in research (Fig. 1).

Fig. 1: A community cycle of valuing software in the research process.
figure 1

Researchers should cite the software that they use in publications, encourage training and peer support, and eventually have the skills and confidence to publish their own software.

A deluge of data

A critically undervalued part of research software engineering is the creation of new data models, data infrastructure, and file standards for emerging technologies. This work is essential, because it provides the underlying framework for labs to create, store, share and collaborate on research data at scale. Technical frameworks for new innovative projects have not yet been created, and only trained engineers with a solid understanding of research can deliver products at the required robustness and scale. Individual labs are infrequently motivated to take this work on, yet a rapidly growing cross-section of science benefits from, and indeed is reliant upon, the efforts of RSEs. For these projects, RSEs need to understand the structure of the data being generated, assess how the data will be consumed and anticipate future challenges and innovations.

The Open Problems in Single-Cell Analysis project provides an open-source, community-driven platform for continuously updated benchmarking of formalized tasks in single-cell analysis. Algorithms for tasks such as batch integration, or comparison of data-denoising methods, can now be easily benchmarked through the use of this platform. With Open Problems platform driving community convergence on new standards, data can be stored with integrity, ported between labs and easily interrogated. Recently, the Open Microscopy Environment consortium, which has maintained a common data model for bioimaging for the past 20 years, described their efforts to create a next-generation file format for bioimaging12, driven by the need to share large imaging data in the cloud. The adoption of this format was achieved only through considerable efforts by RSEs to update existing tools, but, critically, also required coordination efforts in the community, organizing events and gentle building of consensus.

Software engineering is a discipline rooted in identifying major challenges and then constructing solutions to them, for the benefit of many. In the case of modern biomedicine, the importance of this mindset and skill is growing rapidly. The deluge of data, potential of advanced computational approaches, and increasing impact of team science together create a research environment with RSEs as a critical central component. It is increasingly clear that sharing data openly at the scale at which it is being generated is not reaching its full potential. Promoting data utility requires not just storage solutions that scale, but also performant software and infrastructure solutions by which data can be made interoperable, visualized and leveraged by experts and non-experts alike. These important contributions need to be recognized and rewarded as biomedical science advances. Research software engineering is poised to revolutionize how the scientific community can democratize not just the data, but also the technical infrastructure and mechanism for interacting with it, providing an opportunity to modernize how scientists and the public engage with research narratives.