Machine learning identifies a strong association between warming and reduced primary productivity in an oligotrophic ocean gyre

Phytoplankton play key roles in the oceans by regulating global biogeochemical cycles and production in marine food webs. Global warming is thought to affect phytoplankton production both directly, by impacting their photosynthetic metabolism, and indirectly by modifying the physical environment in which they grow. In this respect, the Bermuda Atlantic Time-series Study (BATS) in the Sargasso Sea (North Atlantic gyre) provides a unique opportunity to explore effects of warming on phytoplankton production across the vast oligotrophic ocean regions because it is one of the few multidecadal records of measured net primary productivity (NPP). We analysed the time series of phytoplankton primary productivity at BATS site using machine learning techniques (ML) to show that increased water temperature over a 27-year period (1990–2016), and the consequent weakening of vertical mixing in the upper ocean, induced a negative feedback on phytoplankton productivity by reducing the availability of essential resources, nitrogen and light. The unbalanced availability of these resources with warming, coupled with ecological changes at the community level, is expected to intensify the oligotrophic state of open-ocean regions that are far from land-based nutrient sources.

. Long-term trends of physical, chemical, and biological properties in time series of monthly mean values from measurements made between 1990 and 2016. Trends are estimates of the Sen slope (a measure of the magnitude of a trend) and its significance (p) from the Seasonal Kendall test of monthly time series, implemented with the seaKen function in R package wql (a maintained version of now-archived package wq, available at: http://cran.r-project.org/package=wq).

Variable
Sen  Table S2. Windowed trends of temperature and net primary production (NPP) over sequential decades beginning in 1990. Trends are estimates of the Sen slope and its significance (p) from time series of monthly mean values, using the seaRoll function in R package wql with a window width of ten years. Data gaps were filled by interpolation using function interpTs, filling missing values with means for the corresponding month. Statistically significant analyses are grey-marked.          x1 is the Julian day, x2 is the day of the year, x3 is the mixed layer depth, x4 is the density gradient between 20 and 120 m depths, x5 is the average temperature between 0 and 120 m depths, x6-12 are the integral values of nitrate, phosphate, silicate, fucoxanthin, chlorophyll b, chlorophyll a, and lutein + zeaxanthin, respectively, between 0 and 120 m depths. x1 is the Julian day, x2 is the day of the year, x3 is the mixed layer depth, x4 is the density gradient between 20 and 120 m depths, x5 is the average temperature between 0 and 120 m depths, x6-12 are the integral values of nitrate, phosphate, silicate, fucoxanthin, chlorophyll b, chlorophyll a, and lutein + zeaxanthin, respectively, between 0 and 120 m depths.

Supplementary methods
Gaussian Processes (Linear Kernel) -In the Gaussian Processes 1 , the prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide based on those if one should refit (online fitting, adaptive fitting) the prediction in some region of interest. Their greatest practical advantage is that they can give a reliable estimate of their own uncertainty. Since Gaussian processes let us describe probability distributions over functions, we can use Bayes' rule to update our distribution of functions by observing training data.
Linear Regression models -In this ML technique 1 , a target prediction value based on independent variables is employed. It finds out a linear relationship between input and output.
Linear Random Forest -The Random Forest 2 is an evolution of the Decision Tree method. In a Decision Tree, all possible outcomes of a decision are shown using a tree branching methodology. The internal nodes are tests on various attributes, the branches of the tree are the outcomes of the tests and the leaf nodes are the decision made after computing all of the attributes. The Random Forests Algorithm handles some of the limitations of Decision Trees Algorithm, namely that the accuracy of the outcome decreases when the number of decisions in the tree increases. So, in the Random Forests Algorithm, there are multiple decision trees that represent various statistical probabilities. All of these trees are mapped to a single tree known as the CART model (Classification and Regression Trees). In the end, the final prediction for the Random Forests Algorithm is obtained by polling the results of all the decision trees.

Support Vector
Machine -This is a machine learning tool for classification and regression 3 . Given labelled training data, the algorithm outputs an optimal hyperplane which categorizes new examples. In the case of regression, a margin of tolerance (epsilon) is set, individualizing the hyperplane which maximizes the margin.