Dataset: A spatially and vertically resolved global grid of dissolved barium concentrations in seawater determined using Gaussian Process Regression machine learning

Final no updates expectedDOI: 10.26008/1912/bco-dmo.885506.2Version 2 (2023-07-11)Dataset Type:model results

Principal Investigator: Tristan J. Horner (Woods Hole Oceanographic Institution)

Student: Oyku Z. Mete (Woods Hole Oceanographic Institution)

BCO-DMO Data Manager: Shannon Rauch (Woods Hole Oceanographic Institution)


Project: The Speed, Signature, and Significance of Barium Transformations in Seawater (The Three S's)


Abstract

We present a spatially and vertically resolved global grid of dissolved barium concentrations ([Ba]) in seawater determined using Gaussian Process Regression machine learning. This model was trained using 4,345 quality-controlled GEOTRACES data from the Arctic, Atlantic, Pacific, and Southern Oceans. Model output was validated by assessing the accuracy of [Ba] simulations in the Indian Ocean, noting that none of the Indian Ocean data were seen by the model during training. We identify a model th...

Show more

XXX

Views

XX

Downloads

X

Citations

The data are output from a machine learning model that was trained using GEOTRACES dissolved Barium ([Ba]) data. Full protocols for sample collection and analysis are provided in the GEOTRACES Cookbook and 2021 Intermediate Data Product (see References), respectively.

Full methods are provided in a companion study, which is in revision for Earth System Science Data (Mete et al., 2023). A summary of methods is provided below.

The features used to predict [Ba] and their associated data sources are summarized in Table 1 of Mete et al. (2023). The first three features (latitude, longitude, depth) record geospatial information that defines the location of an observation in three-dimensional space. Features 4–9 encode physical (temperature, salinity) and chemical (oxygen, nutrients) information that is routinely measured alongside [Ba]. These data were generally available for the same bottle as the [Ba] measurements; however, when that was not the case, nutrient data were taken from the corresponding location during a separate cast, or, in the case of oxygen, from linearly interpolated sensor data. Features 10-12 are independent of depth, meaning that all samples within a given vertical profile exhibit the same value for mixed-layer depth, sea-surface chlorophyll a, and bathymetry.

Table 2 of Mete et al. (2023) identifies all dataset sources of d[Ba] ingested into the master record. The data ingestion process resulted in a master record containing 5,502 observations of [Ba] that also contained a corresponding value for all 12 of the features of interest described above. The record was then split into a Pareto partition: the first partition was used for ML model training (4,345 observations, 79 % of data) and the second for model testing (1,157 data; 21 %).

We opted for supervised ML using a Gaussian Process Regression learner, implemented in MATLAB. The training partition of the master record was used to train 4,095 different machine learning models with the goal of finding a model that could accurately simulate the global distribution of [Ba]. Each model uses a unique combination of the 12 features and our testing followed a factorial design whereby each feature was either enabled or disabled. In the second stage of cross validation, trained models were used to predict [Ba] for the withheld data from the Indian Ocean. The accuracy of the models was assessed by comparing ML model predictions against observed [Ba]. We then winnowed the list of models from 4,095 to a single, highly accurate model (#3080), which we used to simulate Ba* and the saturation state of seawater with respect to barite on a global basis.

Refer to Mete et al. (2023) for complete methodology, results, and discussion.

The data provided here include the resulting global grid of dissolved [Ba], Ba*, and barite saturation state as well as Supplemental Files used in testing and training of the model.

The code used in running the model is also provided here in the Supplemental File "Model_3080_code.zip". "predictBa.m" is a code that allows users to predict [Ba] in seawater based on input data for seven predictors: depth, temperature, salinity, dioxygen, phosphate, nitrate, and silicate. Predictions of [Ba] are made using "trainedModel_Exp3080.mat", which is a Gaussian Process Regression Machine Learning Model that was trained to simulate [Ba] based on these seven inputs. Instructions on how to use the model are provided in the comments to predictBa.m and example input data are provided in "exampleData.xlsx". The code was written in MATLAB, and should work on all versions beyond 2018a. All settings, configurations, and the training process are described in a companion study by Mete et al. (2023).


Related Datasets

No Related Datasets

Related Publications

Results

Mete, Ö. Z., Subhas, A. V., Kim, H. H., Dunlea, A. G., Whitmore, L. M., Shiller, A. M., Gilbert, M., Leavitt, W. D., & Horner, T. J. (2023). Barium in seawater: dissolved distribution, relationship to silicon, and barite saturation state determined using machine learning. Earth System Science Data, 15(9), 4023–4045. https://doi.org/10.5194/essd-15-4023-2023
Methods

Cutter, Gregory, Casciotti, Karen, Croot, Peter, Geibert, Walter, Heimbürger, Lars-Eric, Lohan, Maeve, Planquette, Hélène, van de Flierdt, Tina (2017) Sampling and Sample-handling Protocols for GEOTRACES Cruises. Version 3, August 2017. Toulouse, France, GEOTRACES International Project Office, 139pp. & Appendices. DOI: http://dx.doi.org/10.25607/OBP-2
References

GEOTRACES Intermediate Data Product Group. (2021). The GEOTRACES Intermediate Data Product 2021 (IDP2021). (Version 1) [Data set]. NERC EDS British Oceanographic Data Centre NOC. https://doi.org/10.5285/CF2D9BA9-D51D-3B7C-E053-8486ABC0F5FD