A database of marine phytoplankton abundance, biomass and species composition in Australian waters

There have been many individual phytoplankton datasets collected across Australia since the mid 1900s, but most are unavailable to the research community. We have searched archives, contacted researchers, and scanned the primary and grey literature to collate 3,621,847 records of marine phytoplankton species from Australian waters from 1844 to the present. Many of these are small datasets collected for local questions, but combined they provide over 170 years of data on phytoplankton communities in Australian waters. Units and taxonomy have been standardised, obviously erroneous data removed, and all metadata included. We have lodged this dataset with the Australian Ocean Data Network (http://portal.aodn.org.au/) allowing public access. The Australian Phytoplankton Database will be invaluable for global change studies, as it allows analysis of ecological indicators of climate change and eutrophication (e.g., changes in distribution; diatom:dinoflagellate ratios). In addition, the standardised conversion of abundance records to biomass provides modellers with quantifiable data to initialise and validate ecosystem models of lower marine trophic levels.

There have been many individual phytoplankton datasets collected across Australia since the mid 1900s, but most are unavailable to the research community. We have searched archives, contacted researchers, and scanned the primary and grey literature to collate 3,621,847 records of marine phytoplankton species from Australian waters from 1844 to the present. Many of these are small datasets collected for local questions, but combined they provide over 170 years of data on phytoplankton communities in Australian waters. Units and taxonomy have been standardised, obviously erroneous data removed, and all metadata included. We have lodged this dataset with the Australian Ocean Data Network (http://portal.aodn.org.au/) allowing public access. The Australian Phytoplankton Database will be invaluable for global change studies, as it allows analysis of ecological indicators of climate change and eutrophication (e.g., changes in distribution; diatom:dinoflagellate ratios). In addition, the standardised conversion of abundance records to biomass provides modellers with quantifiable data to initialise and validate ecosystem models of lower marine trophic levels.

Background & Summary
Phytoplankton are microalgae at the base of the food web and directly or indirectly support all marine life. As highly efficient primary producers they are critical to maintaining biodiversity and supporting fisheries throughout the ocean 1 . Due to their high turnover rates and sensitivity to changes in environmental conditions phytoplankton are useful indicators of changing oceanographic conditions, climate change, and deterioration in water quality 2,3 . Some phytoplankton produce toxins which may be accumulated by filter feeding shellfish, causing irritation, serious illness or death to animals and humans. A litre of seawater may contain up to one million algal cells, representing at least 100-150 different species 4 . These include the larger phytoplankton, dominated by the diatoms and dinoflagellates, but also include flagellates and the coccoid picoplanktonic forms. Currently, 537 infraspecific dinoflagellate and 938 diatom taxa are known to inhabit coastal and oceanic waters around Australia 5,6 . The fractions of nanoplankton flagellates and coccoid picoplankton, although smaller in size, can account for up to 90% of the total phytoplankton chlorophyll under low biomiass scenarios in offshore waters 4 . The latter are difficult, or impossible, to identify with light microscopy. Including a reliable estimate of biomass, along with cell abundance will provide more realistic information about the phytoplankton community structure at a particular point in time.
Many researchers have phytoplankton data sitting around on paper records or in spreadsheets, some published and some unpublished, which may eventually be lost or misplaced. Consultancies often hold data archives from several research projects from which data are released only to the client with no time and incentive to publish. These data, archived thoroughly, standardised and freely available to a wider audience, are an invaluable resource for the research community. Small, individual datasets with limited stand alone impact, can collectively provide valuable additions to large scale spatial and temporal studies. Many of these data have previously been published in some form. Journal articles, theses and reports and especially older publications, rarely include the relevant dataset so the data are not available for use after publication unless the author releases the data to a publically accessible platform. Even when data are released, they are often only presence data and are not explicit as they could be.
The Australian Phytoplankton Database has been collated from literature, active and retired researchers, consultancies, archives and databases (Fig. 1). Only data with the relevant corresponding metadata about collection location, date and methods has been included. Figure 2 shows the spatial extent of the records collected in this data set. Taxonomic identification of many phytoplankton taxa is fraught with difficulties, especially when limited to light microscopy. Whilst all have been standardised to the correct current classification as given by the World Register of Marine Species (WoRMS) (http://www.marinespecies.org/aphia.php?p = webservice), it is clearly not possible to verify every identification made by each analyst.
The Australian Phytoplankton Database contains data on marine phytoplankton abundance, biomass and species composition. It can be used to: • develop distribution or biogeographic maps 4,7 ; • determine community structures and range changes over time 8,9 or oceanographic conditions 10 ; • understand dynamics of harmful algal species to help inform the aquaculture industry 11 ; • develop inputs to climate, ecosystem and fisheries models to inform management about resources 12 .
The Australian Phytoplankton Database is available through the Australian Ocean Data Network (AODN: http://portal.aodn.org.au) portal. This portal is the main repository for marine data in Australia. The Phytoplankton Database will be maintained in the CSIRO data centre and can be updated with new records, which will automatically upload to the AODN. Researchers wishing to submit new data should contact the corresponding author or the AODN. A snapshot of the Australian Phytoplankton Database as it is at the time of this publication has been assigned a DOI and will be maintained in perpetuity by the AODN (Data Citation 1).
The Australian Phytoplankton Database has been built with ease of use and minimising user error in mind. Therefore it provides the clean data at a level that requires minimal interpretation. The CSIRO database holds all the raw data, for example original species names and ambiguous records, from these datasets, and researchers can request further information from the corresponding author if required.

Methods
Samples were mainly collected from Niskin bottles, net drops or tows, and the Continuous Plankton Recorder (CPR). These are all standard methods of collecting phytoplankton samples 13,14 and many are still largely reliant on a phytoplankton manual written in 1978 (ref. 15). A few samples were collected using automated samplers on moorings 16 . The sampling was done via research vessels, container ships and small boats by experienced researchers, students and volunteers. The majority of the samples were preserved using Lugol's solution, although formalin, paraformaldehyde and glutaraldehyde have also been used. Different methods of preservation can affect the condition of the sample and which taxa are well preserved 4,17 . Samples were analysed with standard methods including light microscopy, transmission or scanning electron microscopy which are described in Hallegraeff, et al. 4 All methodological variations within our phytoplankton database are detailed in the metadata where available and recorded for each data entry. Where available a citation is referenced for each project, which gives details on methodologies and limitations from that project (Table 1 (available online only)).
There were three stages in data gathering. The first stage was to conduct a literature review of Australian phytoplankton data. Any literature that contained abundance or presence data was digitised and uploaded into the CSIRO maintained Oracle database. The second stage was to scan already existing databases, such as the CSIRO data trawler, the Ocean Biogeographic Ocean System (OBIS) and the Atlas of Living Australia (AOLA). These repositories only store presence records. Relevant data were selected and uploaded into the database. The third stage was to ask researchers to contribute any other data sets that they had. All data were organised into a standard format and uploaded into the database. Data were then served to and hosted by the AODN.
All taxa have been verified as accepted species and given the currently accepted name as defined by the World Register of Marine Species database (WoRMS-http://www.marinespecies.org/aphia.php?p = webservice). If any taxa could not be verified, then a second check was done through AlgaeBase (http://www.algaebase.org/). If this did not verify that taxa as a valid name then the taxonomic level of identification was decreased to a satisfactory level or the entry removed. All abundance values were standardised to cells.l − 1 or are given as presence only. Records of the original identifications and units were archived so any records can be checked.
Identification of the smaller phytoplankton is often to a coarser taxonomic level as many cannot be distinguished to species using light microscopy. In some studies electron microscopy has been used to determine species, but in other studies functional groups have been identified. This data set does not include accessory pigment data which can help resolve these smaller taxa 18 although it can be thought of as a complementary dataset. Over 20 years of pigment data are available in Australia via the AEsOP database (http://aesop.csiro.au/).
Cell biovolume is calculated as per Hillebrand, et al. 19 following the suggested shape factors for each genera. Size parameters were estimated from measurements taken from Australian samples or Australian references where available 4,20 and other sources where not [21][22][23] . In some data sets, direct measurements of size classes (e.g., P599 the Australian Continuous Plankton Recorder Survey and P597 the IMOS National Reference Stations), were used in preference to literature values. For some taxa there was insufficient information available to estimate a biovolume, these were generally the rare taxa. Rather than estimate a size class without any information available, these have been left blank.

Data Records
Each data record represents the abundance or presence of a phytoplankton taxa at a certain point in space and time and has been given a unique record identification number, P(project_id)_(sample_id)_ (record_id). Each data record belongs to a project, with each project having a unique identification number, Pxxx. A project is defined as a set of data records which have been collected together, usually as a cruise or study with the same sampling method and having the same person counting the samples. Metadata ascribed to a project relates to all the data records within that project. Details to identify each separate project are given in Table 1 (available online only). Each sample within that project has a unique sample_id. The sample_id has not been changed from the original data set to maintain traceability. So these may be duplicated between projects but P(project_id)_(sample_id) will be a unique entity in space and time. Species abundance records within the sample are given a unique record_id.
The majority of these projects have been uploaded as part of the collation of data for this database. None have been previously published as datasets but The IMOS National Reference Stations, P599, and Continuous Plankton Recorder Survey, P597, which together constitute half the data in this database, are continually updated and available through the AODN (https://portal.aodn.org.au/search?uuid=dfef238f-db69-3868-e043-08114f8c8a94 and https://portal.aodn.org.au/search?uuid=c1344979-f701-0916-e044-00144f7bc0f4 respectively). Table 1 (available online only) gives summary information on the project data sets, their space, time and taxonomic resolutions, numbers of samples and records available. Users can select data sets from this information and download as desired through the AODN.

Technical Validation
The Phytoplankton Database will provide an extensive resource for phytoplankton researchers, although there are some caveats due to the variety of the sampling and analysis protocols.
The various sample collection methods infer that abundances might not always be directly comparable across projects. For example, quantitative methods such as bottle sampling (e.g., Project 599) will collect all but the rarest phytoplankton and will include the whole size spectrum, whereas semi-quantitative methods such as net sampling are selective and dependent on tow method, mesh size and the mix of species present in the water as some species may clog the net and trap smaller species that would otherwise go through the mesh (e.g., Project 509) 21 . By including all data collected using different methods and including this information as meta-data, researchers are able to analyse the relative abundance of each taxa within a project and compare across compatible projects. Metadata includes as much detail as is available about sampling methods and limitations and provides guidance to the users about the potential of each data set. Users should consider collection methods, preservation techniques and microscopic limitations when comparing datasets. All datasets have been standardised to taxa/m3 of water except P1070 where the units are taxa per gram of substrate measured. This project collected the phytoplankton by collecting substrates and analysing parts of the substrate. It is included here as the only data set on Gambierdiscus and associated benthic dinoflagellates from Australia which are important to the studies of the ciguatera.
All datasets submitted can be interpreted as confirming the presence of those species recorded. In some datasets, e.g., time series, it is possible also to infer absences, assuming that all species are looked for on each sampling occasion. Absences have been included in the data sets where the project information available allowed us the confidence to interpret such absences correctly. The interpretation of absences from other projects is at the discretion of the user. We suggest that a project-by-project approach should be taken. If a taxa is not observed at all in a project, then the absence could be due to the taxa not being present, that taxa not being of interest to the analyst, or the inability to identify that taxa. Thus, a real absence should not be inferred. If a taxa is observed in some samples of a project, it can most likely be assumed that the microscopist could identify the taxa, and that a real absence may be inferred in samples within that project where the taxa was not marked as present.
Some data records were removed when there was ambiguity as to the identification of the taxa, i.e., when the taxonomic traceability, usually from older sources, was confused or when spelling mistakes make it unclear which one of two species was meant. Species known as freshwater species were removed   as the methods used to collect data were not aimed at freshwater environments and the inclusion of the odd records of these species would not be comprehensive or meaningful. Estuarine species were captured and the records kept. Data records with positions on land, with an unreal number for abundance or with impossible dates were also removed or converted to presence records.

Usage Notes
The database contains information on the functional group of the species, which can aid analysis. Functional groups include diatoms, dinoflagellates, flagellates, ciliates (including tintinnids), silicoflagellates, and cyanobacteria. Once downloaded and binned as required, data are suitable for use in the creation of ecological indicators. For example: • Total diatoms, dinoflagellates • Diatom:Dinoflagellate ratio • Total phytoplankton abundance or biomass per degree square Abundance (cells.l − 1 ) for the phytoplankton counts is given where it is available, providing more information about the productivity of an area than presence data alone. A low cell abundance may indicate a low level or production, but this may not be the case if these are large cells. The biomass data helps to show productivity of an area. The biovolume has been calculated for each cell count and when converted to biomass, is available for use by modellers and to assist in interpretation of an area's productivity. An accepted method of converting biovolume to biomass is to assume that the cell has the density of water (1 mm 3 .l − 1 = 1 mg.l − 1 ) 13 . Another useful conversion is to carbon biomass; full methods are readily found in the literature [24][25][26] . Table 2 gives details conversions of phytoplankton size data to carbon biomass.
Some of the data records are missing dates or coordinates. It was considered useful to keep these records as the presence of a taxa in a location may still be of value. The user may estimate coordinates from the location given and would then also be aware that these would not be the exact coordinates of the sample.
Data can be analysed in many different ways and in many software applications (e.g., R, Matlab). We include here some figures created in The R Project for Statistical Computing (https://www.r-project.org) to demonstrate some potential uses of the data (Fig. 3).
In some cases, notably project 599, the IMOS National Reference Stations, the phytoplankton component of the survey is only a part of the data available. Additional biogeochemical data are available for this data set via the AODN. Some of the projects, viz. P479, P599, P597, have corresponding zooplankton data freely available in The Australian Zooplankton Database 27 . These data sets can be matched by the project_id and the sample_id which are consistent across databases. The list of citations referenced in Table 1 (available online only) will also give users information as to how this data has been previously used from the discrete projects.