The data are output from a machine learning model that was trained using GEOTRACES dissolved Barium ([Ba]) data. Full protocols for sample collection and analysis are provided in the GEOTRACES Cookbook and 2021 Intermediate Data Product (see References), respectively.
Full methods are provided in a companion study, which is in revision for Earth System Science Data (Mete et al., 2023). A summary of methods is provided below.
The features used to predict [Ba] and their associated data sources are summarized in Table 1 of Mete et al. (2023). The first three features (latitude, longitude, depth) record geospatial information that defines the location of an observation in three-dimensional space. Features 4–9 encode physical (temperature, salinity) and chemical (oxygen, nutrients) information that is routinely measured alongside [Ba]. These data were generally available for the same bottle as the [Ba] measurements; however, when that was not the case, nutrient data were taken from the corresponding location during a separate cast, or, in the case of oxygen, from linearly interpolated sensor data. Features 10-12 are independent of depth, meaning that all samples within a given vertical profile exhibit the same value for mixed-layer depth, sea-surface chlorophyll a, and bathymetry.
Table 2 of Mete et al. (2023) identifies all dataset sources of d[Ba] ingested into the master record. The data ingestion process resulted in a master record containing 5,502 observations of [Ba] that also contained a corresponding value for all 12 of the features of interest described above. The record was then split into a Pareto partition: the first partition was used for ML model training (4,345 observations, 79 % of data) and the second for model testing (1,157 data; 21 %).
We opted for supervised ML using a Gaussian Process Regression learner, implemented in MATLAB. The training partition of the master record was used to train 4,095 different machine learning models with the goal of finding a model that could accurately simulate the global distribution of [Ba]. Each model uses a unique combination of the 12 features and our testing followed a factorial design whereby each feature was either enabled or disabled. In the second stage of cross validation, trained models were used to predict [Ba] for the withheld data from the Indian Ocean. The accuracy of the models was assessed by comparing ML model predictions against observed [Ba]. We then winnowed the list of models from 4,095 to a single, highly accurate model (#3080), which we used to simulate Ba* and the saturation state of seawater with respect to barite on a global basis.
Refer to Mete et al. (2023) for complete methodology, results, and discussion.
The data provided here include the resulting global grid of dissolved [Ba], Ba*, and barite saturation state as well as Supplemental Files used in testing and training of the model.
The code used in running the model is also provided here in the Supplemental File "Model_3080_code.zip". "predictBa.m" is a code that allows users to predict [Ba] in seawater based on input data for seven predictors: depth, temperature, salinity, dioxygen, phosphate, nitrate, and silicate. Predictions of [Ba] are made using "trainedModel_Exp3080.mat", which is a Gaussian Process Regression Machine Learning Model that was trained to simulate [Ba] based on these seven inputs. Instructions on how to use the model are provided in the comments to predictBa.m and example input data are provided in "exampleData.xlsx". The code was written in MATLAB, and should work on all versions beyond 2018a. All settings, configurations, and the training process are described in a companion study by Mete et al. (2023).