Open science resources for the discovery and analysis of Tara Oceans data

The Tara Oceans expedition (2009–2013) sampled contrasting ecosystems of the world oceans, collecting environmental data and plankton, from viruses to metazoans, for later analysis using modern sequencing and state-of-the-art imaging technologies. It surveyed 210 ecosystems in 20 biogeographic provinces, collecting over 35,000 samples of seawater and plankton. The interpretation of such an extensive collection of samples in their ecological context requires means to explore, assess and access raw and validated data sets. To address this challenge, the Tara Oceans Consortium offers open science resources, including the use of open access archives for nucleotides (ENA) and for environmental, biogeochemical, taxonomic and morphological data (PANGAEA), and the development of on line discovery tools and collaborative annotation tools for sequences and images. Here, we present an overview of Tara Oceans Data, and we provide detailed registries (data sets) of all campaigns (from port-to-port), stations and sampling events.


Background & Summary
Over many centuries, global expeditions have led to major scientific breakthroughs, notably with the early voyages of the H.M.S. Beagle (1831-1836) and the H.M.S. Challenger (1872-1876). Ocean exploration now provides promising first steps towards understanding the role of the ocean in global biogeochemical cycles and the impact of global climate change on ocean processes and marine biodiversity. Recently, the Sorcerer II expeditions (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010) 1 and the Malaspina expedition (2010-2011) 2 carried out global surveys of prokaryotic metagenomes from the ocean's surface and bathypelagic layer (>1,000 m), respectively. The Tara Oceans Expedition (2009)(2010)(2011)(2012)(2013) complemented these surveys by collecting a wide variety of planktonic organisms (from viruses to fish larvae) from the ocean's surface (0-200 m) and mesopelagic zone (200-1,000 m) at a global scale. Overall, Tara Oceans surveyed 210 ecosystems in 20 biogeographic provinces, collecting over 35,000 samples of seawater and plankton. Organising such a knowledge base is essential to safeguard, discover and share Tara Oceans data. To address this challenge, Tara Oceans offers open science resources, including the use of open access data archives and the development of online tools for the collaborative annotation of sequences and images, and the discovery of Tara Oceans data.
Tara Oceans adopts the principle of open access and early release of raw and validated data sets. In the case of molecular data, raw short sequence reads are archived at the European Bioinformatics Institute short read archive (http://www.ebi.ac.uk/ena/) and made available immediately after manual curation of metadata. More advanced data (assemblies, annotations, etc.) will be released immediately after validation and before publication, and other versions will be released when available. In the case of environmental, biogeochemical, taxonomic and morphological measurements, data are published at PANGAEA, Data Publisher for Earth and Environmental Science (http://www.pangaea.de) and made available immediately after manual curation of metadata.
By combining modern sequencing and state-of-the-art imaging technologies, Tara Oceans is at the cutting edge of marine science 3 . The amount of data generated by these technologies is unprecedented in the field of plankton ecology and requires adapted storage infrastructures and collaborative platforms to carry out manual and automated annotation of sequences and high throughput images. These open science resources are currently being developed by Tara Oceans.
A first series of publications has demonstrated the potential of Tara Oceans data to study the ecology of plankton and the structural and functional diversity of viruses, prokaryotes and eukaryotes in the global ocean [4][5][6][7][8][9][10][11] . These publications are based on a fraction of the samples analysed so far and thus represent only the tip of the iceberg. The exploration of Tara Oceans data by the scientific community will undoubtedly lead to new hypotheses and emerging concepts in domains unforeseen by the Tara Oceans Consortium. The current discovery portal of Tara Oceans offers a simple map interface that links each sampling location to available environmental and molecular data (http://www.taraoceansdataportal.org/). It will however evolve to offer advanced search functionalities based on geospatial, methodological, environmental, morphological, taxonomic, phylogenetic and ecological criteria.
Here, we present an overview of the sampling strategy and size-fractionation approach of the Tara Oceans Expedition (Methods Section) and we explain the rationale behind the choice of sampling devices (Technical Validation Section). Most importantly, we provide registries (data sets) describing all campaigns (from port-to-port), stations and sampling events (Data Records Section). These registries contain geospatial, temporal and methodological information that will be essential for researchers to explore and assess the quality of Tara Oceans data. Environmental data sets are already available openly, in whole or in part, and additional data sets will be progressively released to the community. We intend to submit additional publications describing specific data types (e.g., Data Citations 1-5) in more detail, further extending the value of this resource as the data becomes available.

Methods
As a research infrastructure, the Tara Oceans Expedition mobilised over 100 scientists to sample the world oceans on board a 36 m long schooner (SV Tara) refitted to operate state-of-the-art oceanographic equipment (Fig. 1). On board the schooner, the team was consistently composed of five sailors and six scientists, including one chief scientist, two oceanography engineers in charge of deck operations, instrument maintenance and data management, two biology engineers preparing and preserving samples for later morphological and genetic analyses, and one optics engineer in charge of imaging live samples on board. A winch equipped with 2,400 m of cable was installed to deploy sampling devices from the stern of the ship, and an industrial peristaltic pump was installed on starboard to sample large volumes of water from various depths down to 60 m. Peristaltic and vacuum filtration systems used to concentrate plankton on membranes of various pore sizes were setup in a laboratory container (wet lab) located outside on port side. Flow-through instruments connected to the continuous surface sampling system were installed in the fore peak and in a laboratory (dry lab) inside the schooner at the centre of the ship on port side.
The sampling strategy and methodology of the Tara Oceans Expedition is presented in six subsections. The first four describe why and how the environmental context was determined [1] at the mesoscale using remote sensing and meteorological data; [2] from sensors mounted on the continuous surface water sampling system; [3] from sensors mounted on the vertical profile sampling system; and [4] from discrete water samplers (Niskin bottles) mounted on the vertical profile sampling system. The last www.nature.com/sdata/ SCIENTIFIC DATA | 2:150023 | DOI: 10.1038/sdata.2015.23 two Sub-Sections describe how [5] environmental features were selected and sampled; and how [6] plankton were collected for imaging and genetic analyses. These methods were also described briefly in Karsenti et al. (2011) 3 .
[1] Atmospheric and oceanographic context at the mesoscale The regular sampling programme was designed to study a variety of marine ecosystems and to target well-defined meso-to large-scale features such as gyres, eddies, currents, frontal zones, upwellings, hot spots of biodiversity, low pH or low oxygen concentrations. A total of 210 stations were characterised at the mesoscale to provide richer environmental context for the morphological and genomic study of plankton (Fig. 2). In order to identify these features before sampling but also to assess a posteriori if sampling events carried out during a station were taken within a relatively homogeneous environment, the atmospheric and oceanographic context were determined at the mesoscale, using climatologies, remote sensing products and arrays of Argo profiling floats. Meteorological forecast services, satellite observations (Chlorophyll a, sea surface temperature (SST) and altimetry) and real-time ocean model outputs (Mercator Ocean) were also used on a daily basis to revise sampling positions with respect to the selected oceanographic features.
Mapped altimetry from AVISO (Archiving Validation and Interpretation of Satellite Data in Oceanography), mapped operational SST (OSTIA), and satellite ocean colour (ACRI-ST GlobColour service) were used to describe the spatial and temporal variability of key environmental parameters at each sampling station. In addition, Temperature-Salinity profiles available around sampling stations were compiled from the Argo autonomous network array. Finally, a [BATOS] meteorological station mounted on-board Tara continuously measured wind speed and direction, and air temperature, pressure and humidity, which helped determine the variability of atmospheric conditions and vertical mixing of surface waters.
In addition to the regular sampling programme, topical experiments were designed to study ocean processes that operate at spatial and/or temporal scales larger or smaller than the mesoscale (Fig. 3). [5] dry lab; [6] oceanography engineers data acquisition and processing area; [7] winch; [8] video imaging area; [9] storage areas at room temperature; [10] storage areas at +4°C and −20°C; [11] MilliQ water system and AC-s system; [12] diving equipment, flowcytobot and ALPHA instruments; and [13]  Topics included, for example, diurnal processes, storm-induced perturbation of community structure and functions, latitudinal diversity gradients, oxygen minimum zone 6 , island effects on iron fertilization, and longitudinal transport by Agulhas rings across the South Atlantic Ocean 11 . For topical experiments, oceanographic context was sometimes enriched by using automated underwater vehicles (e.g., gliders 12 , ProvBio 13 ), surface-tethered Argo drifters, lowered ADCP mounted on the rosette, basin scale eddy-field simulations and climatologies, and state-of-the-art physical models of global ocean circulation with biogeochemistry and genome-informed models of microbial processes 14 . The specific sampling strategy of each topical experiment is available in the respective campaign summary reports (see Data Records Section). Tara Oceans data corresponding to methods described in this section are in part already open to the public at PANGAEA (Data Citation 1).
[2] Properties of seawater and particulate matter from physical, optical and imaging sensors mounted on the continuous surface water sampling system Continuous measurements of surface water physical, chemical and biological properties serve the dual purpose of a) assessing the boundaries and the homogeneity/heterogeneity of an ecosystem studied during a station, and b) assessing the connectivity between stations. Underway measurements were often used to fine tune the location of sampling stations that were initially selected based on satellite images.  15 . Systems maintenance (instrument cleaning, flushing) was done approximately once a week and in port between successive campaigns. In the Arctic Ocean and Arctic Seas (2013 campaigns), additional sensors for pH, PCO 2 , optical backscattering (3 wavelengths), fluorescence emission [ALFA] and surface Photosynthetically Active Radiation [PAR] were added to the in-line system. A [FlowCytobot] also recorded images of microplankton every 20 min. Using daily discrete measurements of CDOM absorption with an [UltraPath] system, we calibrated the [AC-S] to also provide hourly CDOM absorption (besides particulate absorption and attenuation). Data were processed, quality-controlled, and are consistent with [3] Properties of seawater and particulate & dissolved matter from physical, optical and imaging sensors mounted on the vertical profile sampling system Repeated deployments of a Rosette Vertical Sampling System [RVSS] during day and night also served the dual purpose of a) assessing the boundaries and the homogeneity/heterogeneity of mesoscale features during a station, and b) assessing the connectivity between stations. These deployments were essential to locate features that have a vertical component and have a signature below the surface, such as eddies, upwellings, fronts, deep chlorophyll maxima, and oxygen minimum zones. The [RVSS] was specifically designed with various sensors, comprising 2 pairs of conductivity and temperature sensors (Sea-Bird), chrorophyll and CDOM fluorometers (WETLabs), a 25 cm transmissiometer for particles 0.5-20 μm (WETLabs), a one-wavelength backscatter meter for particles 0.5-20 μm (WETLabs), and a Underwater Vision Profiler 16 for particles >100 μm and zooplankton >600 μm (Hydroptic). A sbe43 oxygen sensor (Sea-Bird) and an In Situ Ultraviolet Spectrophotometer (ISUS) nitrate sensor (SATLANTIC) were also mounted on the Rosette. In the Arctic Ocean and Arctic Seas (2013 campaigns), a second sbe43 oxygen sensor (Sea-Bird) and a four frequency acoustic profiler (Aquascat) were added. Each component was powered on specific Li-Ion batteries and CTD data were self-recorded at 24 Hz. All sensors were calibrated in factory before, during and after the four year programme. Oxygen data were validated using climatologies. Nitrate and Fluorescence data were adjusted with discrete measurements from Niskin bottles mounted on the Rosette, and dark calibrations of the optical sensors were performed monthly on-board. A total of 837 vertical profiles were made during the Expedition. Additional stand-alone Sea-Bird components [sbe19] and [sbe9S] were exceptionally mounted directly on the oceanographic cable during harsh sea conditions, when the deployment of the rosette was not safe. In addition, apparent optical properties of sea water were measured using a surface tethered

[5] Environmental features and sampling stations
During the Tara Oceans Expedition (2009-2013), plankton were sampled from 5-10-m thick layers in the water column, corresponding to specific environmental features that were characterised on-board from sensor measurements. Environmental features are defined by controlled vocabularies in the environmental ontology (EnvO; http://environmentontology.org/) 17 .
The surface water layer (ENVO:00002042), sometimes labelled in the literature and databases as "surface", "SRF", "SUR", "SURF" or "S", was simply defined as a layer between 3 and 7 m below the sea surface. The deep chlorophyll maximum layer (ENVO:01000326), often labelled in the literature and databases as "DCM" or "D", was determined from the chlorophyll fluorometer (WETLabs optical sensors) mounted on the Rosette Vertical Sampling System [RVSS]. The presence of a DCM may indicate a maximum in the abundance of plankton bearing chlorophyll pigments, or it may result from the higher chlorophyll content of plankton living in a darker environment 18 . This can be assessed a posteriori using water samples analysed for pigments by HPLC methods and from plankton counts. The mesopelagic zone (ENVO:00000213), also labelled in the literature and databases as "MESO" or "M", corresponds to the layer between 200 and 1000 m depths. The sampling depth within the mesopelagic zone was selected based on vertical profiles of temperature, salinity, fluorescence, nutrients, oxygen, and particulate matter. The selected depth varied from station to station, targeting for example a nutricline, a minimum concentration of oxygen, a maximum concentration of particulate matter, or a fixed depth of ca. 400 m when no particular feature could be identified. Other environmental features of special scientific interest include the oxygen minimum zone (ENVO:01000065), often labelled in the literature and data sets as "OMZ" or "O", and the epipelagic mixing layer (ENVO:01000061), also labelled in the literature and data sets as "ML", "MIX" or "X".
A complete sampling station consisted of collecting plankton from three distinct environmental features, typically the surface water layer, deep chlorophyll maximum layer, and mesopelagic zone (Fig. 4). Such a station lasted typically 24-48 h and special care was taken to reposition SV Tara in order to remain within a radius of 10 km and sample a homogeneous ecosystem as much as possible (see previous two sub-sections). The sequence of sampling deployments varied but generally followed the order illustrated in Fig. 4 Plankton were sampled from a total of 210 stations, of which: 51 stations did not target a specific environmental feature and conducted classical vertical profiles of physical and optical sensors and depth integrated net tows; 57 stations sampled only the surface water layer; 62 stations sampled the surface water layer and a second depth-specific feature; and 40 stations sampled the surface water layer, the deep chlorophyll maximum layer and a third depth-specific feature. Tara Oceans data corresponding to methods described in this section are already open to the public at PANGAEA (Data Citations 6-8) and are described in the Data Records Section of the present paper.

[6] Marine plankton
Plankton sampled during the Tara Oceans Expedition cover six orders of magnitude in size (10 − 2 -10 5 μm) and correspond to viruses, giant viruses (giruses), prokaryotes (bacteria and archaea), unicellular eukaryotes (protists), and multicellular eukaryotes (metazoans). These five groups form the bulk of biomass throughout the oceans and drive the global biogeochemical cycles that regulate the Earth system [19][20][21] . Ocean viruses play an important role in plankton ecology by inducing mortality, horizontal gene transfer, and modulating microbial metabolism 22 . They are thought to target diverse prokaryotic and eukaryotic hosts including microalgae and heterotrophic protists, and to play a role in the evolution of their hosts 23 . Small viruses ( o0.2 μm) are known to be ubiquitous and the most abundant plankton in seawater, while the larger giant viruses or giruses (0.2-1 μm) were discovered more recently and are increasingly observed in marine samples 24,25 . Prokaryotes are believed to be responsible for 30% of primary production and 95% of community respiration in oceans 26 and are thus a fundamental component of marine food webs and biogeochemical processes. They are often divided and studied as two size fractions: the free-living prokaryotes range in size from 0.22-3 μm, and those that are attached to larger cells, particles, or aggregates are found in the 3-20 μm size-fraction 27 . In most cases, they are very difficult to culture. Unicellular eukaryotes, or protists, cover a broad range of cell size (0.8-2,000 μm). They are taxonomically very diverse with representatives in all of the 8 super-groups of the eukaryotic tree of life 28 , whose roles in marine and Earth systems ecology are largely unexplored. Only the most abundant groups, such as diatoms and dinoflagellates, have been studied extensively in the field and cultured successfully 29 . Meso-zooplankton (metazoans; multicellular eukaryotes) range in size from 50 μm to tens of metres in colonial forms, and play a pivotal role in both the transfer of energy to higher trophic levels such as fish and other large predators, and in the vertical export of particulate matter produced at the surface of the ocean 30 . Their life history (e.g., metabolism, development, locomotion, reproduction, feeding) and their body-size are important properties affecting these two processes 31 .
Various sampling methods were used to capture the diversity of both the dominant and less abundant organisms described above (see Technical Validation Section). These methods effectively separated organisms into 10 size fractions: o5 μm (or o3 μm), 5-20 μm (or 3-20 μm), o20 μm, 20-180 μm and 180-2,000 μm for planktonic viruses, prokaryotes and unicellular eukaryotes, and >50 μm, >200 μm, >300 μm, >500 μm and >680 μm for large planktonic unicellular eukaryotes and metazoans. Whenever possible, replicate sampling was performed to assess plankton natural variability and to ensure long-term storage of samples in view of future re-analysis using new technologies, notably in the fields of high throughput imaging and -omics which are evolving extremely rapidly.
Detailed protocols concerning the filtration, preservation and storage of plankton samples will be described in detail as a separate publication. Morphological data will be openly released at PANGAEA (http://www.pangaea.de) and nucleotides data will be openly released as they become available at the European Nucleotide Archive (http://www.ebi.ac.uk/ena/). Whole seawater collected by these devices was then pre-filtered successively on nylon conical sieves with a mesh of 200 μm and 20 μm, and additionally 5 μm for protists (Table 1 (available online only)). The filtrate was collected in four to six 100-L polyethylene containers, which were thoroughly washed with 0.1% bleach, rinsed twice with fresh water and rinsed again twice with the filtrate. Depending on protocols, the o5 μm and o20 μm filtrates were further fractionated on-board using one or a combination of membranes with pore sizes 0.1 μm, 0.2 μm, 0.45 μm, 0.7 μm, 0.8 μm, 1.6 μm or 3 μm. The retention efficiency of meshes, pore-membranes and fibre-filters is a constant debate in plankton ecology. Organisms display various shapes, including high length-to-width ratios, some may easily "squeeze" through pores smaller than their "normal" size, and others may form colonies or tend to aggregate into particles much larger than their individual size. We do not intend to assess the efficiency of the various meshes and filters in retaining the different groups of organisms targeted during the Tara Oceans Expedition. We simply picked commonly used size-fractions and accept the fact that organisms or parts of organisms from the different groups may be present in several size-fractions.
The choice of size thresholds used to collect small eukaryotes, and prokaryotes associated with small particles or with eukaryotes varied during the Tara Oceans Expedition, between 3-20 μm and 5-20 μm. Plankton from that size fraction comprise organisms that are often not abundant enough in whole seawater and often too fragile to be collected with plankton nets that are themselves too delicate to be deployed in rough seas. The sampling method was therefore weather-dependent and often a combination bleach, rinsed twice with freshwater and rinsed again twice with seawater pre-filtered on 0.1 μm. The volume of net sample was adjusted to 3 l with 0.1 μm pre-filtered seawater. After each use, nets, cod-ends, and sieves were rinsed with fresh water and checked for holes.
[6b] Sampling large planktonic unicellular eukaryotes and metazoans. Sampling devices used to concentrate and collect the larger and less abundant organisms (>50 μm size fractions) consisted of plankton nets with mesh sizes ranging from 50 to 680 μm [NET-TYPE-MESH] and metal pan-shaped sieves [SIEVE-MESH] to remove large organisms as needed (Table 1 (available online only)). All nets were equipped with a flow meter and a temperature-depth recorder, and their depth was monitored and adjusted during deployments using an acoustic SCANMAR system. Upon recovery, all nets were rinsed from the outside with running seawater. Cod-ends and metal sieves used to size-fractionate samples were rinsed with running seawater pre-filtered successively on 25 μm and 0.1 μm, using Polygard-CR Cartridge Filters (CR2501006, CRK101006). After each use, nets, cod-ends, and sieves were rinsed with fresh water and checked for holes.
Organism-selectivity and capture-efficiency of plankton nets depend on the mesh size and deployment methods, i.e., depth, tow method (oblique/vertical/horizontal), tow speed and time of day 32 . During the Tara Oceans Expedition, plankton nets were deployed during day and night in order to capture the nycthemeral vertical migrations. Both [NET-WPII-50] and [NET-WPII-200] were towed vertically or obliquely from a depth of 100 m to the surface, during night and daytime, at a speed of 0.3-0.5 m/s depending on weather conditions. Both [NET-BONGO-300] and [NET-REGENT-680] were towed obliquely from a depth of 500 m to the surface, during night and daytime, at a speed of 0.5 m/s depending on weather conditions. Net samples were preserved on-board with buffered formaldehyde, ethanol or RNA-Later for later morphological and/or molecular analyses.
Where time and weather allowed, a multiple opening-closing net equipped with 5 nets of 300 μm mesh size [NET-MULTI-300] was deployed preferentially at night or during daytime to study the vertical distribution of zooplankton. Nets opened and closed at selected depths between 1,000 m and the surface, according to water column features identified from vertical profiles of temperature, salinity, fluorescence, nutrients, oxygen, and particulate matter. Samples were preserved on-board in buffered formaldehyde. The Underwater Vision Profiler (UVP) mounted on the rosette was also used to study the vertical distribution of zooplankton >600 μm during day and night.
In 2011-2013, a neuston net [NET-MANTA-500] was towed at the surface for about 1 h at a speed of 0.7 m/s in order to collect plastic particles and associated organisms. Samples were preserved in ethanol for later morphological and molecular analyses. Finally, a Continuous Plankton Recorder (CPR) was deployed between stations in 2013. Samples were preserved in formaldehyde and sent to the Sir Alister Hardy Foundation for Ocean Science (SAHFOS) for later morphological and molecular analyses.

Data Records
Tara Oceans developed best practices for the standardisation and interoperability of data generated across environmental, morphological and molecular analyses. This effort contributed to the publication of a set of standards for reporting and serving data in Marine Microbial Biodiversity, Bioinformatics and Biotechnology (M2B3) 33 . Here we describe three levels of the M2B3 reporting standard: campaigns, stations and events. For each level, we provide a registry of all campaigns/stations/events, pdf documents describing each campaign/station/event, and universal resource locator (URL) queries to access related nucleotides and environmental data.

Registries
The Tara  The campaigns registry provides details about the scientific interest of each campaign, a list of scientists on board, and URLs for the corresponding campaign summary report (pdf), environmental data sets and nucleotides data sets. The stations registry provides details about the geographic context of each station, including mean and maximum bathymetric depth (extracted from the General Bathymetric Chart of the Oceans; GEBCO), minimum distance from the coast, the corresponding marine biomes and biogeographical provinces defined by Longhurst 34 , and when applicable information about the corresponding exclusive economic zone and related legal aspects. Additionally for each station, we provide information about the environmental features that were sampled and their depth, and the number of deployments carried out with the different sampling devices listed in Table 1 (available online only). URLs provide access to the corresponding oceanographic context report (pdf), environmental data sets and nucleotides data sets.
The events registry provides details about the sampling date, time, location and methodology of each event. URLs provide access to the corresponding event log sheet (pdf), environmental data sets and nucleotides data sets. Sampling events occurring outside the context of a station were assigned the station label TARA_999. Such events include for example underway measurements of the on-board meteorological station [BATOS] and of the continuous surface sampling system [CSSS], and exceptional deployments of plankton nets [NET-TYPE-MESH] or of the rosette vertical sampling system [RVSS].

Reports and log sheets
Campaign reports were written by the chief scientist and scientific crew in order to document the objectives and main achievements of each campaign, as well as any deviations from the regular sampling programme (e.g.,TARA_20110401Z_report.pdf). The station reports were written by Flavian Kokoszka, Rémi Laxenaire and Sabrina Speich to provide background knowledge of the physical oceanography at each station (e.g.,TARA_100_oceanographic_context_report.pdf). Finally, event log sheets were filled on board each time a sampling device was deployed, recording the position, date, time, type of device, sampling depths, volume sampled, ID and filename of sensor outputs, operator's comments, and unique identifiers (barcodes) of samples collected during the event (e.g.,TARA_20110415T1312Z_100_E-VENT_CAST.pdf). Most of the information found in reports and log sheets was extracted manually, quality checked using controlled vocabularies, and archived in the campaigns, stations and events registries. Nevertheless, these narrative documents remain a valuable and complementary source of information. The three registries contain universal resource locators (URLs) pointing to the reports and log sheets of each campaign/station/event. Reports and log sheets can also be browsed directly in the PANGAEA store: Campaign reports: http://store.pangaea.de/Projects/TARA-OCEANS/Campaign_Reports/ Station reports: http://store.pangaea.de/Projects/TARA-OCEANS/Station_Reports/ Event log sheets: http://store.pangaea.de/Projects/TARA-OCEANS/Logsheets_Event/ Up-to-date lists of available nucleotides and environmental data sets Tara Oceans data will progressively be released openly following their analysis and validation. Up-to-date lists of nucleotides and environmental data sets can be obtained from universal resource locator (URL) queries that are made specific to any campaign/station/event by using labels from the campaigns, stations and events registries.
A list of environmental data sets published at PANGAEA can be obtained by combining the following base URL: http://www.pangaea.de/search?q= with a search term. The URL query is made specific to any Tara Oceans campaign, station or event by adding the corresponding label as the search term, see   (Table 2) for viruses (including giant viruses), prokaryotes, protists and metazoans (coloured boxes). The sampling devices used to collect plankton o 5 μm in size (i.e., high volume peristaltic pump and rosette with Niskin bottles) and >5 μm in size (i.e., plankton nets) are illustrated as well on the horizontal plane. The vertical plane shows the volume of seawater required to capture 100, 75 and 50% of species richness reported in the literature ( Table 2) for viruses (including giant viruses), prokaryotes, protists and metazoans (shaded boxes). The typical volume of seawater collected by sampling devices are shown in comparison (horizontal thick lines). Also illustrated on the vertical plane: Sieves were used to remove large organisms from protists net samples.

Technical Validation
Here we provide a first order validation of the Tara Oceans sampling methodology by compiling published values of plankton cell/body size, natural abundance and richness (Table 2). These are compared to the sampling volume and mesh size of the different sampling methods (Fig. 5).
Life history traits such as cell/body size and the natural range of abundance determine the general structure and dynamics of food webs and other ecological networks, across multiple scales of organisation 31,35 . Here we characterise the five groups of plankton by their size and abundance in seawater, using values from the literature ( Table 2). The range of these characteristics are summarised for each plankton group using coloured areas on the horizontal plane of Fig. 5. As already described for a wide range of organisms 36 , the literature shows an inverse relationship between plankton size and abundance in the natural environment, so that small viruses (10 − 2 -10 0 μm) generally form the most abundant group (10 7 -10 11 ind. L − 1 ), whereas the larger metazoans (10 1 -10 5 μm) are generally the least abundant group (10 − 4 -10 3 ind. L − 1 ).
Species richness and evenness are used to estimate species diversity, and should therefore be considered when designing sampling strategies and methodologies for biodiversity studies. Here we characterise the five groups of plankton by their species richness in seawater, using values from the literature ( Table 2, refs. . From these values we made "back of the envelope" calculations of the volume of seawater required to capture 100, 75 and 50% of species within each group of plankton ( Fig. 5; coloured areas on the vertical plane). The effectiveness of our sampling strategy can be assessed by comparing these coloured areas with the sampling volume of the various sampling devices used during the Tara Oceans Expedition (horizontal full lines on the vertical plane).
Based on this assessment, it appears that our sampling strategy would capture o50% of total richness for viruses and small size protists (0.8-5 μm). Accordingly, one would need to filter thousands of litres of seawater in order to capture 75% of total richness for these groups. This is both impractical for most field campaigns and dependent on how one defines the currency of richness for these groups, i.e., the concept of species. In all other groups and size fractions, our sampling strategy appears to capture >75% of species richness, and 100% in the case of large size protists and metazoans. It is important to note that data about plankton richness is very scarce in the literature, so that this assessment is only a first approximation. Tara Oceans data will undoubtedly contribute to fill this knowledge gap and improve the sampling design of future ocean biodiversity surveys.

Usage Notes
The Tara Oceans data policy follows the open science principle of open access and early release of raw and validated data sets. All data presented here (Data Citations 6-8) are published under the Creative Commons Attribution 3.0 Unported (CC-by 3.0) and must therefore be cited when used in scientific papers, posters or presentations. As with most scholarly publications, data citations have authors, a title, a year of publication and a digital object identifier (see data citations in the Reference Section). Furthermore, we kindly ask to include the Tara Oceans Consortium in the acknowledgements. When referring to the Tara Oceans Data or to the sampling strategy and methodology of the Tara Oceans Expedition, please cite the present paper.