Water is the basis of life on Earth. Water covers approximately 71% of the Earth’s surface, of which 2.5% is fresh water1,2. Society relies on the availability of adequate quantity and quality of water for drinking, hygiene, growing food, transport, regulating microclimate and maintaining an enjoyable environment. At the same time, the water cycle (evapo–transpiration, drainage, evaporation from water bodies, cloud formation and precipitation) makes water a global, common good, as the water availability in one place is strongly influenced by the land use in another. The effects of fossil fuel burning on climate and the global re-distribution of water is a prominent example of this influence3. Global pollution of water with persistent (and non-persistent) chemicals is also becoming increasingly problematic, with e.g., per- and polyfluoroalkyl substances (PFAS) now detected in rainwater above safe limits4, leading to increasing calls for action as ‘there is no effective dilution for persistent global pollution’5.

The importance of water for society and the global relevance of water research means that such research needs to be freely accessible and re-usable globally for everybody, i.e., without the need for paid licenses to either view and re-use the publications or to use the data and related code employed in the research. Due to its global societal relevance, there is a particular onus on water research to be easily traceable and reproducible by a wide range of stakeholders.

Water research should be accessible to everybody

Opening science opens worlds of opportunities for greater societal gain6, especially in the dissemination of research and knowledge to those communities most affected by changes in water quality, quantity, and accessibility. One prominent example in environmental chemistry includes the recent discovery of 6PPD-quinone, a transformation product of tire rubber particles responsible for the death of coho salmon as a result of road runoff in storm events7 that has since triggered extensive research into the influence of tire wear on the environment. A second example is the identification of the cyanotoxin responsible for eagle deaths8, a mystery which took 25 years to solve. There are already countless examples of extreme flood events, droughts, extensive fish kills, and surface waters being declared unfit for human consumption due to various combinations of natural phenomena and complex contamination events, exemplified in the recent event in the Oder River9, which is still not clarified.

The need for rapid, open dissemination of findings is ever increasing to allow for large collaborative efforts such as the development of Early Warning Systems (EWS) for the preservation of wildlife, the human population and water resources. EWS are being developed in several areas, examples including the NormanEWS initiative10, one of the stimulating initiatives for chemical EWS developments within the European Partnership for the Assessment of Risks from Chemicals (PARC)11,12. The Environment Agency in England has also set up a national-scale Prioritisation and Early Warning System (PEWS) for contaminants of emerging concern (CECs)13. Similarly, Flood awareness systems (FAS) are also being developed on the European (EFAS)14 and Global (GloFAS) level15. However, reliable flood forecasting relies heavily on real-time sharing of highly resolved meteorological and satellite data (see e.g. ref. 16). The Climate Risk and Early Warning Systems (CREWS) initiative of the United Nations (UN) is operating in 19 countries in Africa and the Pacific most prone to tropical cyclones and floods, including Least Developed Countries (LDCs) and Small Island Developing States (SIDS), with rollouts planned into further countries in Africa and Asia17. Beside immediate catastrophic events, water resources are subject to slow and persistent trends, whose discovery is only possible by free access to long time series of hydrological data across the globe. For example, groundwater recharge time scales vary globally between centuries and millennia, with the longest time scales found in arid systems18, meaning that over-exploitation of groundwater may be both hardest to detect and most difficult to undo in systems that most heavily rely on it. Thus, large scale problems require global efforts, and all these collaborative efforts will rely on more Open Science.

Open Science goes beyond Open Access publishing and FAIR data

While much focus in recent years has been put on Open Access publishing, this is only a small part of Open Science. According to the 2015 FOSTER taxonomy19, Open Science integrates Open Access, Open Data, Open Source, and Open Reproducible Research (all of which we will touch on here, see Fig. 1a), while UNESCO and others have extended this further (e.g.,6). Open Data is commonly associated with the ‘FAIR Principles’20, which describe how to make data findable, accessible, interoperable, and reusable. The FAIR principles were introduced in 2016 and provide vital guidance that can be applied irrespective of whether the data itself is strictly open or not21. Note that the FAIR principles do not enforce Open Access, i.e., FAIR data is not automatically Open Data. Conversely, Open Data that is neither FAIR nor managed (see Fig. 1b) can easily be useless data. Thus, the combination of Open and FAIR data is extremely important. However, even Open Access publishing combined with Open and FAIR data does not necessarily make the research reproducible and re-usable, as discussed further below.

Fig. 1: The many elements of Open Science.
figure 1

a, Open Science (centre, blue) and the four elements of Open Science pointed out by UNESCO most pertinent to this article (orange-ish circles with text). The remaining elements of Open Science described by UNESCO were removed for space reasons. They are represented in the ‘…’ circle along with the smaller decorative bubbles to show that Open Science covers many facets, big and small6. b, FAIR data vs. Open Data vs. Managed Data. Image modified from ref. 36. Managed data means that the data has in some way been collected, stored, organized and maintained. There is a large proportion of managed data that is neither FAIR nor Open, along with a large proportion of unmanaged Open Data. Since both cases are difficult to include in reproducible workflows, scientists and journals alike should be working on expanding the intersection between FAIR and Open Data.

Open Access is the subset of Open Science that includes principles and practices for distributing research outputs online, free of cost or other access barriers22,23. This includes for instance Open Access publications (e.g., the dissemination of research as so-called Green, Gold or Diamond Open Access) or the use of preprint servers to access earlier versions of research articles.

Open Data refers to the availability of the data behind the published research, typically hosted in either institutional or domain-specific data repositories (e.g., HydroShare for hydrological data24), or generic repositories such as Zenodo or FigShare. For Open Access publications and Open Data, appropriate license conditions should be stipulated, so that the conditions of re-use are clear. Creative Commons (CC) licenses are commonly used, with CC0 (public domain) and CC-BY (re-use with attribution) being the most permissive. Other restrictions on CC licenses can cause problems for downstream use. For instance, the ‘ND’ (no derivatives) clause forbids re-use for derivative works, i.e., any actual re-use other than re-distribution of the original work, while ‘NC’ (non-commercial use only) can prevent commercial companies (e.g., instrument vendors) from integrating Open Data into vendor-provided instrument libraries that could be used by researchers. The ‘SA’ (share-alike) clause can enforce a license on downstream users that they may not be able to comply with, thus preventing integration of Open Data in other open projects (due to incompatible licenses). While Open Data is an important starting point, without the availability of appropriate metadata and sufficient FAIRness to make the data findable, accessible, re-useable and interoperable, Open Data alone is only of limited use. In the era of ‘big data’, it is now relatively easy to create a quick dump of data, but curation and FAIRification of data requires a concerted effort, which may necessitate either incentives (carrot) or mandates (stick). The Global Natural Products Social Molecular Networking (GNPS) ecosystem25 is a prime example for incentivising Open Data sharing. Starting primarily as a mass spectral data repository for metabolomics, the developers have consistently added features and functionality over the years to value-add the repository and increase motivation for deposition. For example, MASST26 has enabled discovery of the neurotoxin domoic acid and analogues within marine samples and food such as ocean-caught mackerel.

Open Source software and code refer to the public availability of source code27, i.e., sets of computer instructions ranging from data processing scripts and algorithms to fully blown numerical models, desktop applications, or even operating systems. The purpose of open source is to provide transparency, and most importantly, re-usability and adaptability of the code, with a common aim of collaborative development. Licenses for Open Source works are generally designed to explicitly cover code sharing, thus Open Source licenses are generally preferred over CC, with common examples including GPL, Apache and MIT27. Suitable code repositories with version control and issue tracking are indispensable for collaborative open source developments, with common platforms including GitHub, GitLab, Bitbucket and more. For all three above-mentioned aspects of Open Science, i.e., Open Access, Open Data and Open Source, the generation of permanent identifiers such as a Digital Object Identifier (DOI)28 is an integral aspect of FAIR and vital to preserve the discoverability and lifetime of such projects.

Finally, open reproducible research is a culmination of all three aspects above. With systems such as RMarkdown and Jupyter Notebooks, it is now possible to have fully compliable research outputs and reproducible manuscripts. The Journal of Open Source Software even accepts submissions as GitHub pull requests and compiles the entire submission on their system; one example relevant to water research is patRoon 2.0 (ref. 29). The ‘open-source knowledge infrastructure for collaborative and reproducible data science’ Renku facilitates traceability and reproducibility of complex workflows involving networks of interconnected code, data and figure files. It does so by automatic provenance tracking of output files and the creation of a version-controlled git repository containing all information, including the computational environment.

How scientists and publishers can strengthen Open Science in water research

While the facilities and infrastructure available to perform Open Science are increasingly available, fully Open Science requires a substantial additional effort beyond the generation of manuscripts and data visualisations. A study in 2019 revealed that out of 360 randomly sampled hydrology papers published in 2017, only 4 (i.e., 1%) were fully reproducible. Articles were considered fully reproducible if the results published in the paper could be reproduced by readers based on the accompanying directions, code and data accessible online. Half of the articles already failed at the general data availability check, whereas most of the others had incomplete supporting information to enable reproduction of results30. To improve on this dire situation, the authors created a survey template to facilitate reproducibility assessments of studies for authors, journals and funders/institutions. More recently, a group of early to mid-career researchers published a practical guideline to Open Science for hydrologists, including approaches for sharing code and documentation and choosing appropriate licenses for facilitating re-usability of research artifacts31.

Several simple steps can be made to support Open Science using existing systems which, over time, will set the basis for successful Open Science efforts to become the ‘new normal’. The setting of open, community endorsed standards is a key step for every field, with examples including the open mzML standard for raw mass spectrometry data32, NetCDF as an open standard for complex data in hydrology33, the International Chemical Identifier (InChI) in chemistry34, or even the simplicity of the comma or tab separated values (CSV, TSV) formats for exchanging data rather than proprietary excel (XLS, XLSX) formats. The provision of templates can also help guide researchers in data sharing in specific domains, as recently discussed for chemistry35 and exposomics23, since the use of standardized headers and simple, interoperable formats can greatly enhance re-use of the data and integration into large knowledge bases. Finally, clear article quality criteria focusing on easily verifiable reproducibility and re-usability of research, and associated highlights of articles meeting such criteria, could provide the right combination of facilitating and incentivising Open Science.

As discussed above, the need for rapid, open dissemination of findings is ever increasing to ensure the success of large collaborative efforts to preserve wildlife, the human population and water resources in a rapidly changing environment. Since water is a common good, we hope that authors and editors alike will join us in this quest for sustaining and supporting Open Science in water research. Together, many seemingly small steps towards Open Science in water research have the potential to create a world of difference.