komanawa.BASE#

The BASE module is a Python package to conduct a Bayesian analysis to constrain a prior of the source concentration of a given groundwater receptor (e.g. well, spring fed stream, etc.) with the measured concentration at the receptor and an inferred age distribution (e.g. via tritium sampling).

This package provides a BASESolver, MultiSiteBASESolver, MultiSiteDilutionBASESolver classes to run the analysis and a set of classes to support generating the prior distribution of the source concentration.

Subpackages#

Submodules#

Classes#

BASESolver

class for predicting the source concentration from the BPEFM model

BaseSampledParam

Schema for a sampled parameter which captures the distribution and arguments for the distribution to check that the distribution is valid and has not changed.

MultiSiteBASESolver

class for predicting the source concentration from the BPEFM model across multiple sites. This class is a subclass of BASESolver

MultiSiteDilutionBASESolver

A MultiSiteBASESolver that includes dilution parameters for each site derived from a truncated half normal distribution.

Functions#

load_npz_source_receptor(npzfile[, ...])

Load the source and receptor data from a npz file (self serialized by BASESolver.get_source_receptor_data)

Package Contents#

class BASESolver(save_dir, n_inflections, ts_data, mrt, mrt_p1, mrt_p2, frac_p1, f_p1, f_p2, precision=2, min_conc=0.01, inflection_times=None, cdf_inflection_start=None, sigma=None, interpolator='linear', interpolator_kwargs=None, serialize_receptor_data=True, predict_ncores=None, nchains=5, niterations=50000, start_age=np.nan, verbose=True)#

class for predicting the source concentration from the BPEFM model

BASE: Bayesian approach to source estimation. This class uses MT-DREAM(ZS) algorithm to conduct Bayesian inference on the source concentration for a measured time series data and a BPEFM model.

The MT-DREAM(ZS) algorithm was introduced in: Laloy, E. & Vrugt, J. A. High-dimensional posterior exploration of hydrologic models using multiple-try DREAM(ZS) and high-performance computing. Water Resources Research 48, W01526 (2012).

Instance Variables:

Variables:
  • age_bounds – Age bounds for the model (years since start date)

  • age_fractions – Age fractions for age in the BEPFM model (see komanawa.gw_age_tools.exponential_piston_flow.make_age_dist)

  • age_step – Age step for the BEPFM model (see komanawa.gw_age_tools.exponential_piston_flow.make_age_dist)

  • ages – Ages for the BEPFM model (see komanawa.gw_age_tools.exponential_piston_flow.make_age_dist)

  • end_date – End date of the observed time series data (pd.Timestamp)

  • f_p1 – Fraction of the total mass in the first peak (age model parameter)

  • f_p2 – Fraction of the total mass in the second peak (age model parameter)

  • frac_p1 – Fraction of the total mass in the first peak (age model parameter)

  • inflection_times – Inflection times for the model (years since start date). Source concentration will be parameterized at these times. Note that len(inflection_times) = n_inflections + 2, where n_inflections is passed in the init as the method automatically includes the start and end of the time series

  • measured_age_bounds – Age bounds for the measured time series data (years since start date)

  • mrt – Mean residence time for the age distribution (yrs) (age model parameter)

  • mrt_p1 – Mean residence time for the first peak (yrs) (age model parameter)

  • mrt_p2 – Mean residence time for the second peak (yrs) (age model parameter)

  • nchains – Number of pydream chains for the model. note each chain is a separate process so if nchains > the processors available then the process will be MUCH slower

  • niterations – The number of algorithm iterations to run.

  • precision – precision of the age distribution (decimal places years) default is 2, approximately monthly

  • predict_ncores – Number of cores for prediction (see param)

  • save_dir – working directory for the model

  • serialize_receptor_data – Boolean flag to serialize receptor data

  • start_date – Start date of the time series data (pd.Timestamp)

  • t – time index for the time series data (decimal years since start date)

  • ts_data – Time series data (pandas Series) Index=self.t, values=concentration

Class attributes:

Variables:
  • available_interpolators – tuple of available interpolators

  • all_t – not used in this class, but used in the child classes

  • _force_old_predict – a flag to force the use of the old predict method for testing purposes

Basic Usage:

from komanawa.BASE import BASESolver, generators, BaseSampledParam
from pathlib import Path
import matplotlib.pyplot as plt

# Set up the problem
first_model_name = 'first_model'
n_inflections = 5  # this will increase to 7 with the addition of a start and end point
ndata = get_ndata()  # get your data, this should be a pandas series with a datetime index
output_dir = Path.home()
cdf_threshold = 0.15
mrt = 12
f_p1 = 0.7

# create the solver
solver = BASESolver(
            save_dir=output_dir,
            ts_data=ndata,
            mrt=mrt,
            mrt_p1=mrt,
            mrt_p2=mrt,
            frac_p1=1,
            f_p1=f_p1,
            f_p2=f_p1,
            n_inflections=n_inflections,
            cdf_inflection_start=cdf_threshold,
        )

# set priors
lows = [0.001, 0.001, 0.1, 0.1, 0.2, 0.5, 1.0]
ups = [0.5, 2, 6, 6, 6, 6, 6]
params = BaseSampledParam(distribution=generators.UniformPathGenerator, low=lows, up=ups, n=7)

# run the model
run = solver.run_base(model_name=first_model_name,
        params=params, verbose=False)

# plot the results
nplot = 0.2 # plot 20% of the results
fig, ax = solver.plot_best_source_receptor(first_model_name, nplot=nplot, receptor_spacing=1)
plt.show()
Parameters:
  • save_dir – directory to run dreamzs in, files are named via model name kwarg

  • n_inflections – number of inflection times to use (int) or frequency of inflection times (see https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases)

  • ts_data – time series data to fit to

  • mrt – mean residence time of the source (yrs)

  • mrt_p1 – mean residence time of the first piston (yrs)

  • mrt_p2 – mean residence time of the second piston (yrs)

  • frac_p1 – fraction of the source that is in the first piston

  • f_p1 – fraction of exponential flow the first piston

  • f_p2 – fraction of exponential flow the second piston (use a dummy value if frac_p1 = 1)

  • precision – precision of the age distribution (decimal places years) default is 2, approximately monthly

  • min_conc – minimum concentration for the interpolated source concentration, default is 0.01. This prevents weird interpolation behaviours in the early poorly constrained part of the time series. This is not a hard limit, but rather a minimum value for the interpolated source concentration. The actual value will be set to the maxumum of the interpolated value and min_conc.

  • inflection_times – None or list of datetimes of the inflection times. If None then the inflection times will be evenly spaced between the start and end of the ages + time series only one of cdf_inflection_start or inflection_times must be None inflection times must be None if n_inflections is a frequency code

  • cdf_inflection_start – None or float, the threshold for the cdf of ages to start the evenly spaced inflection points only one of cdf_inflection_start or inflection_times must be None.

  • sigma – None or float or array of floats, the uncertainty in the time series data, if None then no uncertainty is used and self.log_likelihood_no_sigma is used, if a float then the same uncertainty is used for all points, if an array then the uncertainty is used for each point. Regardless if sigma is not None then self.log_likelihood_sigma is used.

  • interpolator

    str, the interpolator to use to interpolate the source concentration from the n_inflection points to the needed frequency for the 1d age function (BEPFM), options are:

    • ’linear’: Piecewise linear interpolation (default)

    • ’cubicspline’:Cubic spline data interpolator., see scipy.interpolate.CubicSpline

    • ’pchip’: PCHIP 1-D monotonic cubic interpolation., see scipy.interpolate.PchipInterpolator

    • ’akima’: Akima interpolator, see scipy.interpolate.Akima1DInterpolator

  • interpolator_kwargs – None or dict, additional kwargs passed to the interpolator (see interpolator)

  • serialize_receptor_data – bool, whether to serialize the receptor data for later use (increase diskuse but decrease runtime)

  • predict_ncores – None or int, number of cores to use for prediction, if None then use all available (virtual) cores

  • nchains – number of chains to run in dreamzs note each chain is a separate process so if nchains > the processors available then the process will be MUCH slower

  • niterations – The number of algorithm iterations to run. The default is 50000. Higher values will take longer to run but may produce better results.

  • start_age – start age for the age distribution (yrs) default is np.nan which will use the maximum of the mrt_p1 and mrt_p2

  • verbose – bool, whether to print progress to the console

convert_age_to_decimal_years(age)#

convert an age to a decimal year

Parameters:

age – years since start date

Returns:

decimal year (e.g. 1950.54)

convert_age_to_dt(age)#

convert an age to a datetime

Parameters:

age – age(s) to convert (years since start date)

Returns:

datetime(s)

convert_decimal_years_to_age(decimal_years)#

convert a decimal year to an age

Parameters:

decimal_years – decimal years to convert e.g. 1950.54

Returns:

age (float) years since start date

convert_decimal_years_to_dt(decimal_years)#

convert a decimal year to a datetime

Parameters:

decimal_years – decimal years to convert e.g. 1950.54

Returns:

datetime

convert_dt_to_age(dt)#

convert a datetime to an age (years since start date)

Parameters:

dt – datetime(s) to convert

Returns:

age(s) years from start date

convert_dt_to_decimal_years(dt)#

convert a datetime to a decimal year

Parameters:

dt – datetime to convert

Returns:

decimal year (e.g. 1950.54)

export_source_receptor(model_name, outdir, nexport, receptor_spacing=1, percentiles=(1, 5, 10, 25), tvals=None, project_forward=None, forward_ts_deltas=None, forward_ts_deltas_type='absolute', percentile_format='hdf', include_raw=False)#

export the source and receptor data to csv files or hdf file store

The HDF file will contain the following keys the csv files will have the same suffixes as the keys below:

  • source_percentiles: source data percentiles

  • receptor_percentiles: receptor data percentiles

  • measured: measured data

  • source_raw: optional if include_raw=True, source data

  • receptor_raw: optional if include_raw=True, receptor data

Parameters:
  • outdir – output directory

  • nexport – number of results to export (int) or fraction of results to export (float, nplot<=1)

  • receptor_spacing – space between receptor prediction points

  • percentiles – tuple of percentiles to export fill between. e.g. a percentile of 5 will plot the 5th and 95th

  • tvals – None or list of times to sample at (years since data starts, self.start_date), Note tvals serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • project_forward – None or int, if an int then project the best n params forward by this many years at the end of the time series

  • forward_ts_deltas – None or pd.Series(index=decimal years since end date(self.age_bounds[1]), values = deltas from time zero if the index for forward ts_deltas is < project forward the remainder of the assumed time series will be assumed to have the same delta as the last point. Note forward_ts_deltas serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • forward_ts_deltas_type – str, ‘absolute’ (default) or ‘fraction’ where ss_conc is the steady state concentration (conc. at self.age_bounds[1]) the forward source is calculated as:

  • if ‘absolute’:
    • forward_source = ss_conc + forward_ts_deltas

  • if ‘fraction’:
    • forward_source = ss_conc * forward_ts_deltas

Parameters:

percentile_format – str, ‘hdf’ or ‘csv’ format to save the percentiles

Returns:

get_parameter_log(model_name)#

get the parameter log for the model run from the model name :param model_name: name for the model (str) :return: dict of parameter names and str values

get_rmse(model_name, ntest, raw=False)#

get the rmse for the model run from the model name

Parameters:
  • model_name – name for the model (str)

  • ntest – number of test simulations to use for the rmse

  • raw – if True return the raw rmse values, if False return the 50th percentile and 5th and 95th percentiles

Returns:

rmse (float)

get_ss_conc(model_name, nplot, site=None)#

get the steady state concentration (e.g. the last concentration in the time series) for the model run :param model_name: model name :param nplot: number of results to plot (int) or fraction of results to plot (float, nplot<=1) :param site: None (not used, for compatibility with MultiSiteDilutionBASESolver) :return: pd.Series of steady state concentrations

load_logps_params(model_name, _concat=True)#

load the unsorted logps and sampled parameters from the model run

Parameters:

model_name – name for the model (str)

Returns:

log_p (np.array) shape=(nsimulations)

Returns:

params (np.array) shape=(nsimulations, nparams)

log_likelihood_no_sigma(params)#

calculate the log likelihood of the model (where the sigma of the observations IS NOT provided). The log likelihood without observation sigma function can be changed by subclassing this class and overwriting this method.

Parameters:

params – sampled parameters for the model instance (see BASESolver.run_base)

Returns:

likelihood (float)

log_likelihood_sigma(params)#

calculate the log likelihood of the model (where the sigma of the observations IS provided). The log likelihood function with observation sigma can be changed by subclassing this class and overwriting this method.

Parameters:

params – sampled parameters for the model instance (see BASESolver.run_base)

Returns:

likelihood (float)

make_forward_ts_deltas(time_series=None, age_series=None, age_precision=2, values_precision=2, forward_ts_deltas_type='absolute')#

make a given time series / age series into deltas (differences between start date) this is a convenience function for creating forward_ts_deltas parameters for the model

Parameters:
  • ts – pd.Series of time series (datetime index)

  • age_series – pd.Series of ages (years since start date)

  • age_precision – int, precision to round the age series index

  • values_precision – int, precision to round the values in the age series values

  • forward_ts_deltas_type – str, ‘absolute’ (default) or ‘fraction’ where ss_conc is the steady state concentration (conc. at self.age_bounds[1]) the forward source is calculated as:

  • if ‘absolute’:
    • forward_source = ss_conc + forward_ts_deltas

  • if ‘fraction’:
    • forward_source = ss_conc * forward_ts_deltas

Returns:

pd.Series of forward_ts_deltas

Returns:

forward_ts_deltas_type: str, ‘absolute’ or ‘fraction’ index = years since start date (self.age_bounds[1])

Returns:

project_forward: int, the number of years to project forward

Returns:

start_val: the value at self.age_bounds[1]

make_receptor_points(receptor_spacing=None, project_forward=None, tvals=None, site=None)#

make the receptor points for the model, the following options are available:

  • make_receptor_points(receptor_spacing=None, project_forward=None, tvals=None) -> self.t (sampling points)

  • make_receptor_points(receptor_spacing=1, project_forward=None, tvals=None) -> evenly spaced receptor points (about yearly) from 0 to self.t.max()

  • make_receptor_points(receptor_spacing=1, project_forward=10, tvals=None) -> evenly spaced receptor points (about yearly) from 0 to self.t.max() + 10

  • make_receptor_points(receptor_spacing=None, project_forward=None, tvals=*args) -> sampling at tvals (years since data starts)

  • all other combinations are invalid and will raise an exception

Parameters:
  • receptor_spacing – the spacing to use (decimal years)

  • project_forward – None or int, if Int project the final concentration N years forward.

  • tvals – None or list of times to sample at (years since data starts, self.start_date)

  • site – None and should always be None, used to ensure same signature as other methods in children classes

Returns:

make_source_receptor_data(model_name, nplot, receptor_spacing=1, tvals=None, project_forward=None, source_only=False, forward_ts_deltas=None, forward_ts_deltas_type='absolute', recalc=False, _return_ncomputed=False, return_interpolated_source=True)#

make the source and receptor data for the run model. if self.serialize_receptor_data is True then the receptor data will be saved to a npz file for later use, and any receptors already computed will be loaded from the file. If the file is not found then the receptor data will be recomputed. The process of predicting the receptor data is parallelized using multiprocessing, but still can take some time, especially for large numbers of simulations.

The save file is unique to the model_name, receptor_spacing, project_forward, and tvals., Note that changes to the class instance attributes are not checked and will not cause the data to be recomputed.

Note that passing tvals or forward_ts_deltas, will serialize the receptor data with the hash of the tvals/forward_ts_deltas used in the filename. Tvals/forward_ts_deltas will also be saved in the npz file and checked on reload. Note that minor changes in the tvals/forward_ts_deltas (e.g. float imprecision) will cause the hash to change and the data to be recomputed. Therefore we suggest rounding tvals/forward_ts_deltas to a reasonable precision before passing them to this method.

Parameters:
  • nplot – number of results to plot (int) or fraction of results to plot (float, nplot<=1)

  • receptor_spacing – the number of years between receptor prediction points

  • tvals – None or list of times to sample at (years since data starts, self.start_date), Note tvals serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • project_forward – None or int, if an int then project the best n params forward by this many years at the end of the time series

  • source_only – bool, if True only return the source data (don’t compute the receptor data)

  • forward_ts_deltas – None or pd.Series(index=decimal years since end date(self.age_bounds[1]), values = deltas from time zero if the index for forward ts_deltas is < project forward the remainder of the assumed time series will be assumed to have the same delta as the last point. Note forward_ts_deltas serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • forward_ts_deltas_type – str, ‘absolute’ (default) or ‘fraction’ where ss_conc is the steady state concentration (conc. at self.age_bounds[1]) the forward source is calculated as:

  • if ‘absolute’:
    • forward_source = ss_conc + forward_ts_deltas

  • if ‘relative’:
    • forward_source = ss_conc * forward_ts_deltas

Parameters:
  • source_only – bool, if True only return the source data (don’t compute the receptor data)

  • recalc – bool, if True then recalculate the receptor data even if it has been serialized

  • _return_ncomputed – bool, if True return the number of computed receptor data points, for testing only

  • return_interpolated_source – bool, if True return the interpolated source data, if False return the raw source data

Returns:

source_data: source concentration predictions (ndarray, shape=(nsims, source_points) (interpolated to self.precision if return_interpolated_source=True)

Returns:

source_points: source concentration timings years since start date (ndarray, shape=(self.n_inflections) if project_forward=None else shape=(self.n_inflections+1)(nsource_points,) (interpolated to self.precision if return_interpolated_source=True)

Returns:

outreceptor_data: receptor concentration predictions (ndarray, shape=(nsims, nreceptor_points)

Returns:

receptor_points: receptor timing years since start date (ndarray, shape=(nreceptor_points)

parameters_match_log(model_name, params, starts, start_random, hardboundaries, multitry, **kwargs)#

check the parameters against the log file, see run_base for more info

Parameters:
  • params – as would have been passed to self.run_base

  • starts – as would have been passed to self.run_base

  • start_random – as would have been passed to self.run_base

  • hardboundaries – as would have been passed to self.run_base

  • multitry – as would have been passed to self.run_base

  • kwargs – as would have been passed to self.run_base

Returns:

args_match(bool), (got_args_dict, expect_arg_dict)

plot_best_receptor(model_name, nplot, receptor_spacing=1, percentiles=(1, 5, 10, 25), pdf_alpha=0.1, plot_inflection_points=True, plot_annual_vlines=True, tvals=None, project_forward=None, forward_ts_deltas=None, forward_ts_deltas_type='absolute', figsize=(10, 10), plot_measured=True, add_rmse=False, ax_recept=None, allsite_on_one=False, rt_leg_hand_lab=False)#

plot the best n predictions of the receptor concentrations

Parameters:
  • model_name – name for the model (str)

  • nplot – number of results to plot (int) or fraction of results to plot (float, nplot<=1)

  • receptor_spacing – space between receptor points (decimal years)

  • percentiles – tuple of percentiles to plot fill between. e.g. a percentile of 5 will plot the 5th and 95th

  • pdf_alpha – alpha for the pdfs note that alphas are additive (e.g. if there are 4 percentiles then the alpha for each percentile will be [pdf_alpha, pdf_alpha * 2, pdf_alpha * 3, pdf_alpha * 4])

  • plot_inflection_points – bool, whether to plot the inflection points

  • plot_annual_vlines – bool, whether to plot annual vertical lines

  • tvals – None or list of times to sample at (years since data starts, self.start_date), Note tvals serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • project_forward – None or int, if an int then project the best n params forward by this many years at the end of the time series

  • forward_ts_deltas – None or pd.Series(index=decimal years since end date(self.age_bounds[1]), values = deltas from time zero if the index for forward ts_deltas is < project forward the remainder of the assumed time series will be assumed to have the same delta as the last point. Note forward_ts_deltas serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • forward_ts_deltas_type – str, ‘absolute’ (default) or ‘fraction’ where ss_conc is the steady state concentration (conc. at self.age_bounds[1]) the forward source is calculated as:

  • if ‘absolute’:
    • forward_source = ss_conc + forward_ts_deltas

  • if ‘fraction’:
    • forward_source = ss_conc * forward_ts_deltas

Parameters:
  • figsize – size for the figure

  • plot_measured – bool, whether to plot the measured data on the plot

  • add_rmse – bool, whether to add the rmse to the plot note this will require generating the receptor concentration at all self.t times, which may add significant time to the plot generation, particularly if the predictions are not serialized

  • ax_recept – None or axis to plot on

  • allsite_on_one – not used for BASESolver

  • rt_leg_hand_lab – if True also return the handles and labels for the legend

Returns:

fig, ax_recept if rt_leg_hand_lab is False (default)

Returns:

fig, ax_recept, (handles, labels), rmse if rt_leg_hand_lab is True

plot_best_source(model_name, nplot, percentiles=(1, 5, 10, 25), plot_inflection_points=True, plot_annual_vlines=True, project_forward=None, forward_ts_deltas=None, forward_ts_deltas_type='absolute', pdf_alpha=0.1, figsize=(10, 10), plot_measured=True, ax_source=None, rt_leg_hand_lab=False)#

plot fill between plots of the best n params and the predicted receptor concentrations

Parameters:
  • model_name – name for the model (str)

  • nplot – number of results to plot (int) or fraction of results to plot (float, nplot<=1)

  • percentiles – tuple of percentiles to plot fill between. e.g. a percentile of 5 will plot the 5th and 95th

  • plot_inflection_points – bool, whether to plot the inflection points

  • plot_annual_vlines – bool, whether to plot annual vertical lines

  • project_forward – None or int, if an int then project the best n params forward by this many years at constant concentration

  • forward_ts_deltas – None or pd.Series(index=decimal years since end date(self.age_bounds[1]), values = deltas from time zero if the index for forward ts_deltas is < project forward the remainder of the assumed time series will be assumed to have the same delta as the last point. Note forward_ts_deltas serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • forward_ts_deltas_type – str, ‘absolute’ (default) or ‘fraction’ where ss_conc is the steady state concentration (conc. at self.age_bounds[1]) the forward source is calculated as:

  • if ‘absolute’:
    • forward_source = ss_conc + forward_ts_deltas

  • if ‘fraction’:
    • forward_source = ss_conc * forward_ts_deltas

Parameters:
  • pdf_alpha – alpha for the pdfs note that alphas are additive (e.g. if there are 4 percentiles then the alpha for each percentile will be [pdf_alpha, pdf_alpha * 2, pdf_alpha * 3, pdf_alpha * 4])

  • figsize – size for the figure

  • plot_measured – bool, whether to plot the measured data on the plot

  • ax_source – None or axis to plot on

  • rt_leg_hand_lab – if True also return the handles and labels for the legend

Returns:

fig, ax_source - if rt_leg_hand_lab is False (default)

Returns:

fig, ax_source (handles, labels), rmse - if rt_leg_hand_lab is True

plot_best_source_receptor(model_name, nplot, receptor_spacing=1, sharey=True, percentiles=(1, 5, 10, 25), title_additon='', pdf_alpha=0.1, plot_inflection_points=True, plot_annual_vlines=True, tvals=None, project_forward=None, forward_ts_deltas=None, forward_ts_deltas_type='absolute', figsize=(10, 10), plot_measured=True, add_rmse=False, ax_source=None, ax_recept=None, allsite_on_one=False, rt_leg_hand_lab=False)#

plot fill between plots of the best n params and the predicted receptor concentrations

Parameters:
  • model_name – name for the model (str)

  • nplot – number of results to plot (int) or fraction of results to plot (float, nplot<=1)

  • receptor_spacing – space between receptor points (decimal years)

  • sharey – bool, whether to share the y axis between the source and receptor plots

  • percentiles – tuple of percentiles to plot fill between. e.g. a percentile of 5 will plot the 5th and 95th

  • title_additon – str, addition to the title no character between the model name and the addition string

  • pdf_alpha – alpha for the pdfs note that alphas are additive (e.g. if there are 4 percentiles then the alpha for each percentile will be [pdf_alpha, pdf_alpha * 2, pdf_alpha * 3, pdf_alpha * 4])

  • plot_inflection_points – bool, whether to plot the inflection points

  • plot_annual_vlines – bool, whether to plot annual vertical lines

  • tvals – None or list of times to sample at (years since data starts, self.start_date), Note tvals serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • project_forward – None or int, if an int then project the best n params forward by this many years at the end of the time series

  • forward_ts_deltas – None or pd.Series(index=decimal years since end date(self.age_bounds[1]), values = deltas from time zero if the index for forward ts_deltas is < project forward the remainder of the assumed time series will be assumed to have the same delta as the last point. Note forward_ts_deltas serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • forward_ts_deltas_type – str, ‘absolute’ (default) or ‘fraction’ where ss_conc is the steady state concentration (conc. at self.age_bounds[1]) the forward source is calculated as:

  • if ‘absolute’:
    • forward_source = ss_conc + forward_ts_deltas

  • if ‘fraction’:
    • forward_source = ss_conc * forward_ts_deltas

Parameters:
  • figsize – size for the figure

  • plot_measured – bool, whether to plot the measured data on the prediction plot

  • add_rmse – bool, whether to add the rmse to the plot note this will require generating the receptor concentration at all self.t times, which may add significant time to the plot generation, particularly if the predictions are not serialized

  • ax_source – matplotlib axis to plot the source data on or None to create a new figure

  • ax_recept – matplotlib axis to plot the receptor data on or None to create a new figure

  • allsite_on_one – not used in BASESolver

  • rt_leg_hand_lab – if True also return the handles and labels for the legend

Returns:

fig, (ax_source, ax_recept) - if rt_leg_hand_lab is False (default)

Returns:

fig, (ax_source, ax_recept), (src_handles, src_labels), (rec_handles, rec_labels), rmse - if rt_leg_hand_lab is True

plot_breakpoints(ax=None)#

plot the inflection points on the time series data

Parameters:

ax – matplotlib axis to plot on or None to create a new figure

Returns:

plot_log_ps(model_name, ax=None)#

plot the log likelihoods for each chain

Parameters:
  • model_name – name for the model (str)

  • ax – matplotlib axis to plot on or None to create a new figure

Returns:

plot_measured(ax, handles, labels, c=None, site=None, label='Measured data', alpha=1, **kwargs)#

plot the measured data on the plot

Parameters:
  • ax – matplotlib axis to plot on

  • handles – handles for the legend (don’t use ax.legend() as it may cause weird behavior)

  • labels – labels for the legend (don’t use ax.legend() as it may cause weird behavior)

  • c – color if None use ‘r’

  • site – site to plot (not used in BASESolver)

  • label – label for the measured data

  • alpha – alpha for the measured data

  • kwargs – other kwargs for the scatter plot

Returns:

plot_single_pdf(model_name, baselab, ax, percentiles, data, xdata, project_forward, pdf_alpha, plot_inflection_points, plot_annual_vlines, plot_xlab, plot_ylab, plot_measured, add_rmse, site=None, plot_color='b', leg_color=None, data_color=None)#

Actually plot the pdfs for the source or receptor data. This is used internally both for the source and receptor data for the plot_best_source_receptor, plot_best_source, and plot_best_receptor methods. It can also be used externally to plot pdfs in the same fashion to support custom plotting. Note that it expects specific data formats, so read the docs!

Parameters:
  • model_name – name for the model (str)

  • baselab – the fundamental label for the dataset (e.g. Source, Receptor… etc.)

  • ax – matplotlib axis to plot on

  • percentiles – tuple of percentiles to plot fill between. e.g. a percentile of 5 will plot the 5th and 95th

  • data – the data to plot (np.array) shape=(nsims, time). typically the data will be either source or receptor data from self.make_source_receptor_data

  • xdata – the xdata to plot against (np.array) shape=(time,). typically the xdata will be either source or receptor points from self.make_source_receptor_data

  • project_forward – None or int, if an int then project the best n params forward by this many years at the end of the time series

  • pdf_alpha – alpha for the pdfs note that alphas are additive (e.g. if there are 4 percentiles then the alpha for each percentile will be [pdf_alpha, pdf_alpha * 2, pdf_alpha * 3, pdf_alpha * 4])

  • plot_inflection_points – bool, whether to plot the inflection points

  • plot_annual_vlines – bool, whether to plot annual vertical lines

  • plot_xlab – bool, whether to set the xlabel

  • plot_ylab – bool, whether to set the ylabel

  • plot_measured – bool, whether to plot the measured data

  • add_rmse – bool, whether to calculate the rmse of the model, which will be returned

  • site – None or the site to plot. site must == None in BASESolver. If the site is None in the multi solvers and plot_measured then the measured data will be plotted for all sites

  • plot_color – color for the pdfs

  • leg_color – color for the pdfs on the legend (if None then plot_color is used)

  • data_color – color for the measured data if None then ‘r’ will be used (BASESolver) or the site color will be used (MultiSolver+)

Returns:

out_ups: the upper percentiles (useful for setting ylims)

Returns:

out_downs: the lower percentiles (useful for setting ylims)

Returns:

handles: handles for the legend (don’t use ax.legend() as it may cause weird behavior)

Returns:

labels: labels for the legend (don’t use ax.legend() as it may cause weird behavior)

Returns:

rmse: float or None, the rmse of the model if add_rmse is True

predict_bespoke(ts_concs, receptor_tvals, site=None, use_dilution=0)#

Predict a bespoke time series source concentration to a give set of receptor time values. This is useful for predicting the concentration at a receptor point that is not in the original time series. Note that this is an advanced feature and may be more difficult to use

Parameters:
  • ts_concs – Pd.Series of source concentrations (index = age in years since start date, self.start_date) values = concentration, Note that the full period of the source data must be passed in, not just the time series of interest.

  • receptor_tvals – list of times to sample at (years since data starts, self.start_date)

  • site – dummy to support same call in MultiSiteDilutionBASESolver etc.

  • use_dilution – dummy to support same call in MultiSiteDilutionBASESolver etc.

Returns:

pd.Series of receptor concentrations (index = receptor_tvals, values = concentration)

run_base(model_name, params, starts=None, verbose=None, nverbose=100, start_random=True, hardboundaries=True, multitry=False, rerun=False, test_run=False, _clear_test_data=True, _raise_exception_on_test=True, raise_if_run=False, **kwargs)#

run the BASE model. This method will run the model using pydream and save the results to the save_dir.

  • If the model has already been run (and rerun=False) then it will check the log files to ensure that the args/kwargs/instance.attrs match the expected args/kwargs/instance.attrs.

  • If the model has not been run then it will run the model and save the results to the save_dir.

  • If test_run is True then it will run the model for the first 10 iterations and then raise an InterruptedError after removing the test data (assuming _clear_test_data=False). This is useful for testing the setup of the model. you may also want to set _raise_exception_on_test=False to avoid the exception being raised.

Parameters:
  • model_name – name for the model (str)

  • params – A list of pydream.parameters.SampledParam objects Note SampledParam uses scipy, so you can pass either a list of instances or an instance that will produce the correct number of parameters. e.g., [SampledParam(uniform, [1,2], [2,3)] should be equivalent to [SampledParam(uniform, 1, 2), SampledParam(uniform, 2, 3)] you can also pass a single Sampled_Param instance which will be used for all inflection points. Note be careful of scipy’s uniform function which uses loc and scale rather than min and max. You can also pass a komanawa.BASE.generators.uniform_path_generator.UniformPathGenerator or komanawa.BASE.generators.normal_path_generator.NormalPathGenerator instance as parameters.

  • starts – Either a list of start locations to initialize chains in, or a single start location to initialize all chains in. Default: None, note that if start_random is True then this will be ignored/overwritten. Starts can be provided as a 1d iterable (nparams,) or a 2d iterable (nchains, nparams)

  • verbose – None (use self.verbose) or bool Whether to print verbose output (including acceptance or rejection of moves and the current acceptance rate)

  • nverbose – The number of iterations between verbose output. Default = 100

  • start_random – Whether to intialize chains from a random point in parameter space drawn from the prior (default = True). Will override starting position set when sample was called, if any.

  • hardboundaries – Whether to relect point back into bounds of hard prior (i.e., if using a uniform prior, reflect points outside of boundaries back in, so you don’t waste time looking at points with logpdf = -inf).

  • multitry – {bool, int} hether to utilize multi-try sampling. Default is False. If set to True, will be set to 5 multiple tries. Can also directly specify an integer if desired.

  • rerun – bool, whether to rerun the model if it already exists

  • test_run – bool, if True then setup the model and run for the first 10 iterations to test the setup, note that this will clear any saved data after running then raise InterruptedError

  • _clear_test_data – bool, if True then clear the data after running, only valid if test_run is True, used for testing only

  • _raise_exception_on_test – bool, if True then raise an InterruptedError after running the test, only valid if test_run is True, used for testing only

  • raise_if_run – bool, if True then raise an InterruptedError if the model would actually run base. This is used for checking that runs exist, parameters are set effectively, but avoiding large re-runs

  • kwargs – other kwargs passed to pydream.core.run_dream

Returns:

run (bool if the model was run)

class BaseSampledParam(distribution, *args, **kwargs)#

Schema for a sampled parameter which captures the distribution and arguments for the distribution to check that the distribution is valid and has not changed.

Parameters:
  • distribution – The distribution to use (e.g. scipy.stats.uniform, scipy.stats.norm)

  • args – The arguments to pass to the distribution

  • kwargs – The keyword arguments to pass to the distribution

serialize()#

Serialize the object to a unique string to capture metadata

Returns:

class MultiSiteBASESolver(save_dir, n_inflections, ts_data, mrt, mrt_p1, mrt_p2, frac_p1, f_p1, f_p2, precision=2, min_conc=0.01, inflection_times=None, cdf_inflection_start=None, sigma=None, interpolator='linear', interpolator_kwargs=None, serialize_receptor_data=True, predict_ncores=None, nchains=5, niterations=50000, start_age=np.nan, cmap='tab20', verbose=True)#

Bases: komanawa.BASE.BASE_solver.BASESolver

Inheritance diagram of komanawa.BASE.MultiSiteBASESolver

class for predicting the source concentration from the BPEFM model across multiple sites. This class is a subclass of BASESolver

BASE: Bayesian approach to source estimation. This class uses MT-DREAM(ZS) algorithm to conduct Bayesian inference on the source concentration for a measured time series data and a BPEFM model.

The MT-DREAM(ZS) algorithm was introduced in: Laloy, E. & Vrugt, J. A. High-dimensional posterior exploration of hydrologic models using multiple-try DREAM(ZS) and high-performance computing. Water Resources Research 48, W01526 (2012).

Instance Variables:

Variables:
  • age_bounds – Age bounds for the model (years since start date)

  • age_fractions – Dictionary of Age fractions for age in the BEPFM model (see komanawa.gw_age_tools.exponential_piston_flow.make_age_dist), keys = site

  • age_step – Age step for the BEPFM model (see komanawa.gw_age_tools.exponential_piston_flow.make_age_dist)

  • all_ages – Dictionary of Ages for the BEPFM model (see komanawa.gw_age_tools.exponential_piston_flow.make_age_dist), keys = site

  • end_date – End date of the observed time series data (pd.Timestamp)

  • f_p1 – array of Fraction of the total mass in the first peak (age model parameter), in order of self.sites

  • f_p2 – array of Fraction of the total mass in the second peak (age model parameter), in order of self.sites

  • frac_p1 – array of Fraction of the total mass in the first peak (age model parameter), in order of self.sites

  • inflection_times – Inflection times for the model (years since start date). Source concentration will be parameterized at these times. Note that len(inflection_times) = n_inflections + 2, where n_inflections is passed in the init as the method automatically includes the start and end of the time series

  • measured_age_bounds – Age bounds for the measured time series data (years since start date)

  • mrt – array of Mean residence time for the age distribution (yrs) (age model parameter), in order of self.sites

  • mrt_p1 – array of Mean residence time for the first peak (yrs) (age model parameter), in order of self.sites

  • mrt_p2 – array of Mean residence time for the second peak (yrs) (age model parameter), in order of self.sites

  • nchains – Number of pydream chains for the model. note each chain is a separate process so if nchains > the processors available then the process will be MUCH slower

  • niterations – The number of algorithm iterations to run.

  • precision – precision of the age distribution (decimal places years) default is 2, approximately monthly

  • predict_ncores – Number of cores for prediction (see param)

  • save_dir – working directory for the model

  • serialize_receptor_data – Boolean flag to serialize receptor data

  • start_date – Start date of the time series data (pd.Timestamp)

  • sites – set of keys for the time series data, essentially a set of sites.

  • all_t – dictionary of time index for the time series data (decimal years since start date), keys = site

  • all_ts_data – dictionary of Time series data (pandas Series) Index=self.t, values=concentration, keys = site

class attributes:

Variables:
  • available_interpolators – tuple of available interpolators

  • _force_old_predict – a flag to force the use of the old predict method for testing purposes, must be False

  • ts_data – key instance attrs for BASESolver that must be None for MultiSiteBASESolver

  • sigma – key instance attrs for BASESolver that must be None for MultiSiteBASESolver

  • t – key instance attrs for BASESolver that must be None for MultiSiteBASESolver

  • ages – key instance attrs for BASESolver that must be None for MultiSiteBASESolver

  • age_fractions – key instance attrs for BASESolver that must be None for MultiSiteBASESolver

  • _int_ages – key instance attrs for BASESolver that must be None for MultiSiteBASESolver

Basic Usage:

from komanawa.BASE import MultiSiteBASESolver, generators, BaseSampledParam
from pathlib import Path
import matplotlib.pyplot as plt

# Set up the problem
first_model_name = 'first_model'
n_inflections = 5  # this will increase to 7 with the addition of a start and end point
ndata_site1 = get_ndata1()  # get your data, this should be a pandas series with a datetime index
ndata_site2 = get_ndata2()  # get your data, this should be a pandas series with a datetime index
output_dir = Path.home()
cdf_threshold = 0.15
mrt_site1 = 12
f_p1_site1 = 0.7
mrt_site2 = 20
f_p1_site2 = 0.5

# create the solver
solver = MultiSiteBASESolver(
            save_dir=output_dir,
            ts_data=dict(site1=ndata_site1, site2=ndata_site2),
            mrt=dict(site1=mrt_site1, site2=mrt_site2),
            mrt_p1=dict(site1=mrt_site1, site2=mrt_site2),
            mrt_p2=dict(site1=None, site2=None),
            frac_p1=dict(site1=1, site2=1),
            f_p1=dict(site1=f_p1_site1, site2=f_p1_site2),
            f_p2=dict(site1=f_p1_site1, site2=f_p1_site2),
            n_inflections=n_inflections,
            cdf_inflection_start=cdf_threshold,
        )

# set priors
lows = [0.001, 0.001, 0.1, 0.1, 0.2, 0.5, 1.0]
ups = [0.5, 2, 6, 6, 6, 6, 6]
params = BaseSampledParam(distribution=generators.UniformPathGenerator, low=lows, up=ups, n=7)

# run the model
run = solver.run_base(model_name=first_model_name,
        params=params, verbose=False)

# plot the results
nplot = 0.2 # plot 20% of the results
fig, ax = solver.plot_best_source_receptor(first_model_name, nplot=nplot, receptor_spacing=1)
plt.show()
Parameters:
  • save_dir – directory to run dreamzs in, files are named via model name kwarg

  • n_inflections – number of inflection times to use (int) or frequency of inflection times (see https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases)

  • ts_data – dictionary of time series data to fit to, sets the keys for the model, keys must be strings, values must be pandas Series with datetime index, monotonically increasing index, and no NaN values

  • mrt – dictionary of mean residence time of the source (yrs), keys must match ts_data

  • mrt_p1 – dictionary of mean residence time of the first piston (yrs), keys must match ts_data

  • mrt_p2 – dictionary of mean residence time of the second piston (yrs), keys must match ts_data

  • frac_p1 – dictionary of fraction of the source that is in the first piston, keys must match ts_data

  • f_p1 – dictionary of fraction of exponential flow the first piston, keys must match ts_data

  • f_p2 – dictionary of fraction of exponential flow the second piston (use a dummy value if frac_p1 = 1), keys must match ts_data

  • precision – precision of the age distribution (decimal places years) default is 2, approximately monthly

  • min_conc – minimum concentration for the interpolated source concentration, default is 0.01. This prevents weird interpolation behaviours in the early poorly constrained part of the time series. This is not a hard limit, but rather a minimum value for the interpolated source concentration. The actual value will be set to the maxumum of the interpolated value and min_conc.

  • inflection_times – None or list of datetimes of the inflection times. If None then the inflection times will be evenly spaced between the start and end of the ages + time series only one of cdf_inflection_start or inflection_times must be None inflection times must be None if n_inflections is a frequency code

  • cdf_inflection_start – None or float, the threshold for the cdf of ages to start the evenly spaced inflection points only one of cdf_inflection_start or inflection_times must be None.

  • sigma – None or dictionary of float or array of floats, the uncertainty in the time series data, if None then no uncertainty is used and self.log_likelihood_no_sigma is used, if a float then the same uncertainty is used for all points, if an array then the uncertainty is used for each point. Regardless if sigma is not None then self.log_likelihood_sigma is used. If dictionary then keys must match the keys of ts_data

  • interpolator

    str, the interpolator to use to interpolate the source concentration from the n_inflection points to the needed frequency for the 1d age function (BEPFM), options are:

    • ’linear’: Piecewise linear interpolation (default)

    • ’cubicspline’:Cubic spline data interpolator., see scipy.interpolate.CubicSpline

    • ’pchip’: PCHIP 1-D monotonic cubic interpolation., see scipy.interpolate.PchipInterpolator

    • ’akima’: Akima interpolator, see scipy.interpolate.Akima1DInterpolator

  • interpolator_kwargs – None or dict, additional kwargs passed to the interpolator (see interpolator)

  • serialize_receptor_data – bool, whether to serialize the receptor data for later use (increase diskuse but decrease runtime)

  • predict_ncores – None or int, number of cores to use for prediction, if None then use all available (virtual) cores

  • nchains – number of chains to run in dreamzs note each chain is a separate process so if nchains > the processors available then the process will be MUCH slower

  • niterations – The number of algorithm iterations to run. The default is 50000. Higher values will take longer to run but may produce better results.

  • start_age – start age for the age distribution (yrs) default is np.nan which will use the maximum of the mrt_p1 and mrt_p2

  • cmap – str, the colormap to use for the plot, see matplotlib.pyplot.get_cmap

  • verbose – bool, whether to print the progress of the model

export_source_receptor(model_name, outdir, nexport, receptor_spacing=1, percentiles=(1, 5, 10, 25), tvals=None, project_forward=None, forward_ts_deltas=None, forward_ts_deltas_type='absolute', percentile_format='hdf', include_raw=False)#

export the source and receptor data to csv files or hdf file store

The HDF file will contain the following keys the csv files will have the same suffixes as the keys below:

  • source_percentiles: source data percentiles

  • {site}_receptor_percentiles: receptor data percentiles

  • {site}_measured: measured data

  • source_raw: optional if include_raw=True, source data

  • {site}_receptor_raw: optional if include_raw=True, receptor data

Parameters:
  • model_name – name for the model (str)

  • outdir – output directory

  • nexport – number of results to export (int) or fraction of results to export (float, nplot<=1)

  • receptor_spacing – space between receptor prediction points

  • percentiles – tuple of percentiles to export fill between. e.g. a percentile of 5 will plot the 5th and 95th

  • tvals – None or list of times to sample at (years since data starts, self.start_date), Note tvals serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • project_forward – None or int, if an int then project the best n params forward by this many years at the end of the time series

  • forward_ts_deltas – None or pd.Series(index=decimal years since end date(self.age_bounds[1]), values = deltas from time zero if the index for forward ts_deltas is < project forward the remainder of the assumed time series will be assumed to have the same delta as the last point. Note forward_ts_deltas serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • forward_ts_deltas_type – str, ‘absolute’ (default) or ‘fraction’ where ss_conc is the steady state concentration (conc. at self.age_bounds[1]) the forward source is calculated as:

  • if ‘absolute’:
    • forward_source = ss_conc + forward_ts_deltas

  • if ‘fraction’:
    • forward_source = ss_conc * forward_ts_deltas

Parameters:
  • percentile_format – str, ‘hdf’ or ‘csv’ format to save the percentiles

  • include_raw – bool, if True include the raw data in the output

Returns:

get_colors(rt_dict=False, repeat_threshold=None)#

get colors for a list of values

Parameters:
  • rt_dict – bool if true return a dictionary of colors else return a list of colors

  • repeat_threshold

Returns:

list of colors or dictionary

get_rmse(model_name, ntest, raw=False)#

get the rmse for the model run from the model name

Parameters:
  • model_name – name for the model (str)

  • ntest – number of test simulations to use for the rmse

  • raw – if True return the raw rmse values, if False return the 50th percentile and 5th and 95th percentiles

Returns:

rmse (float)

log_likelihood_no_sigma(params)#

calculate the log likelihood of the model (where the sigma of the observations IS NOT provided). The log likelihood without observation sigma function can be changed by subclassing this class and overwriting this method.

Parameters:

params – sampled parameters for the model instance (see BASESolver.run_base)

Returns:

likelihood (float)

log_likelihood_sigma(params)#

calculate the log likelihood of the model (where the sigma of the observations IS provided). The log likelihood function with observation sigma can be changed by subclassing this class and overwriting this method.

Parameters:

params – sampled parameters for the model instance (see BASESolver.run_base)

Returns:

likelihood (float)

make_source_receptor_data(model_name, nplot, receptor_spacing=1, tvals=None, project_forward=None, forward_ts_deltas=None, forward_ts_deltas_type='absolute', source_only=False, recalc=False, _return_ncomputed=False, return_interpolated_source=True)#

make the source and receptor data for the run model. if self.serialize_receptor_data is True then the receptor data will be saved to a npz file for later use, and any receptors already computed will be loaded from the file. If the file is not found then the receptor data will be recomputed. The process of predicting the receptor data is parallelized using multiprocessing, but still can take some time, especially for large numbers of simulations.

The save file is unique to the model_name, receptor_spacing, project_forward, and tvals., Note that changes to the class instance attributes are not checked and will not cause the data to be recomputed.

Note that passing tvals or forward_ts_deltas, will serialize the receptor data with the hash of the tvals/forward_ts_deltas used in the filename. Tvals/forward_ts_deltas will also be saved in the npz file and checked on reload. Note that minor changes in the tvals/forward_ts_deltas (e.g. float imprecision) will cause the hash to change and the data to be recomputed. Therefore we suggest rounding tvals/forward_ts_deltas to a reasonable precision before passing them to this method.

Parameters:
  • nplot – number of results to plot (int) or fraction of results to plot (float, nplot<=1)

  • receptor_spacing – the number of years between receptor prediction points

  • tvals – None or list of times to sample at (years since data starts, self.start_date), Note tvals serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • project_forward – None or int, if an int then project the best n params forward by this many years at the end of the time series

  • source_only – bool, if True only return the source data (don’t compute the receptor data)

  • forward_ts_deltas – None or pd.Series(index=decimal years since end date(self.age_bounds[1]), values = deltas from time zero if the index for forward ts_deltas is < project forward the remainder of the assumed time series will be assumed to have the same delta as the last point. Note forward_ts_deltas serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • forward_ts_deltas_type – str, ‘absolute’ (default) or ‘fraction’ where ss_conc is the steady state concentration (conc. at self.age_bounds[1]) the forward source is calculated as:

  • if ‘absolute’:
    • forward_source = ss_conc + forward_ts_deltas

  • if ‘fraction’:
    • forward_source = ss_conc * forward_ts_deltas

Parameters:
  • source_only – bool, if True only return the source data (don’t compute the receptor data)

  • recalc – bool, if True then recalculate the receptor data even if it has been serialized

  • return_interpolated_source – bool, if True return the source data interpolated to self.precision

  • _return_ncomputed – bool, if True return the number of computed receptor data points, for testing only

Returns:

source_data: source concentration predictions (ndarray, shape=(nsims, source_points) (interpolated to self.precision if return_interpolated_source=True)

Returns:

source_points: source concentration timings years since start date (ndarray, shape=(self.n_inflections) if project_forward=None else shape=(self.n_inflections+1)(nsource_points,) (interpolated to self.precision if return_interpolated_source=True)

Returns:

outreceptor_data: dict {site: data}: receptor concentration predictions (ndarray, shape=(nsims, nreceptor_points)

Returns:

receptor_points: dict {site: data}: receptor timing years since start date (ndarray, shape=(nreceptor_points)

plot_best_receptor(model_name, nplot, receptor_spacing=1, percentiles=(1, 5, 10, 25), pdf_alpha=0.1, plot_inflection_points=True, plot_annual_vlines=True, tvals=None, project_forward=None, forward_ts_deltas=None, forward_ts_deltas_type='absolute', figsize=(10, 10), plot_measured=True, add_rmse=False, ax_recept=None, allsite_on_one=False, rt_leg_hand_lab=False)#

plot the best n predictions of the receptor concentrations for each site (one per plot)

Parameters:
  • model_name – name for the model (str)

  • nplot – number of results to plot (int) or fraction of results to plot (float, nplot<=1)

  • receptor_spacing – space between receptor points (decimal years)

  • percentiles – tuple of percentiles to plot fill between. e.g. a percentile of 5 will plot the 5th and 95th

  • pdf_alpha – alpha for the pdfs note that alphas are additive (e.g. if there are 4 percentiles then the alpha for each percentile will be [pdf_alpha, pdf_alpha * 2, pdf_alpha * 3, pdf_alpha * 4])

  • plot_inflection_points – bool, whether to plot the inflection points

  • plot_annual_vlines – bool, whether to plot annual vertical lines

  • tvals – None or list of times to sample at (years since data starts, self.start_date), Note tvals serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • project_forward – None or int, if an int then project the best n params forward by this many years at the end of the time series

  • forward_ts_deltas – None or pd.Series(index=decimal years since end date(self.age_bounds[1]), values = deltas from time zero if the index for forward ts_deltas is < project forward the remainder of the assumed time series will be assumed to have the same delta as the last point. Note forward_ts_deltas serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • forward_ts_deltas_type – str, ‘absolute’ (default) or ‘fraction’ where ss_conc is the steady state concentration (conc. at self.age_bounds[1]) the forward source is calculated as:

  • if ‘absolute’:
    • forward_source = ss_conc + forward_ts_deltas

  • if ‘fraction’:
    • forward_source = ss_conc * forward_ts_deltas

Parameters:
  • figsize – size for the figure

  • plot_measured – bool, whether to plot the measured data on the plot

  • add_rmse – bool, whether to add the rmse to the plot note this will require generating the receptor concentration at all self.t times, which may add significant time to the plot generation, particularly if the predictions are not serialized

  • ax_recept – must be None, just to match BASESolver call.

  • allsite_on_one – bool, if True plot all sites on one plot, false make a plot for each site

  • allsite_on_one – bool, if True plot all sites on one plot :param rt_leg_hand_lab: if True also return the handles and labels for the legend

Returns:

dictionary {site: (fig, ax_recept)} - if if rt_leg_hand_lab is False (default)

Returns:

dictionary {site: (fig, ax_recept)}, {site: (rec_hand, rec_lab)}, rmse - if if rt_leg_hand_lab is True

plot_best_source_receptor(model_name, nplot, receptor_spacing=1, sharey=True, percentiles=(1, 5, 10, 25), title_additon='', pdf_alpha=0.1, plot_inflection_points=True, plot_annual_vlines=True, tvals=None, project_forward=None, forward_ts_deltas=None, forward_ts_deltas_type='absolute', figsize=(10, 10), plot_measured=True, add_rmse=False, ax_source=None, ax_recept=None, allsite_on_one=False, rt_leg_hand_lab=False)#

plot fill between plots of the best n params and the predicted receptor concentrations for each site, one per plot

Parameters:
  • model_name – name for the model (str)

  • nplot – number of results to plot (int) or fraction of results to plot (float, nplot<=1)

  • receptor_spacing – space between receptor points (decimal years)

  • sharey – bool, whether to share the y axis between the source and receptor plots

  • percentiles – tuple of percentiles to plot fill between. e.g. a percentile of 5 will plot the 5th and 95th

  • title_additon – str, addition to the title no character between the model name and the addition string

  • pdf_alpha – alpha for the pdfs note that alphas are additive (e.g. if there are 4 percentiles then the alpha for each percentile will be [pdf_alpha, pdf_alpha * 2, pdf_alpha * 3, pdf_alpha * 4])

  • plot_inflection_points – bool, whether to plot the inflection points

  • plot_annual_vlines – bool, whether to plot annual vertical lines

  • tvals – None or list of times to sample at (years since data starts, self.start_date), Note tvals serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • project_forward – None or int, if an int then project the best n params forward by this many years at the end of the time series

  • forward_ts_deltas – None or pd.Series(index=index=decimal years since end date(self.age_bounds[1]), values = deltas from time zero if the index for forward ts_deltas is < project forward the remainder of the assumed time series will be assumed to have the same delta as the last point. Note forward_ts_deltas serialize the data by hash so we suggest rounding to a reasonable precision before passing them to this method.

  • forward_ts_deltas_type – str, ‘absolute’ (default) or ‘fraction’ where ss_conc is the steady state concentration (conc. at self.age_bounds[1]) the forward source is calculated as:

  • if ‘absolute’:
    • forward_source = ss_conc + forward_ts_deltas

  • if ‘fraction’:
    • forward_source = ss_conc * forward_ts_deltas

Parameters:
  • figsize – size for the figure

  • plot_measured – bool, whether to plot the measured data on the prediction plot

  • add_rmse – bool, whether to add the rmse to the plot note this will require generating the receptor concentration at all self.t times, which may add significant time to the plot generation, particularly if the predictions are not serialized

  • ax_source – must be None

  • ax_recept – must be None

  • allsite_on_one – bool, if True plot all sites on one plot :param rt_leg_hand_lab: if True also return the handles and labels for the legend

Returns:

dictionary {site: (fig, (ax_source, ax_recept))} - if if rt_leg_hand_lab is False (default)

Returns:

dictionary {site: (fig, (ax_source, ax_recept))}, {site: ((src_hand,src_lab),(rec_hand, rec_lab)), rmse - if if rt_leg_hand_lab is True

plot_measured(ax, handles, labels, c=None, site=None, label='Measured data', alpha=1, **kwargs)#

plot the measured data

Parameters:
  • ax – matplotlib axis to plot on

  • handles – handles for the legend (don’t use ax.legend() as it may cause weird behavior)

  • labels – labels for the legend (don’t use ax.legend() as it may cause weird behavior)

  • c – color If None, plot site colors

  • site – site to plot, if None then plot all sites

  • label – label for the measured data

  • alpha – alpha for the measured data

  • kwargs – other kwargs for the scatter plot

Returns:

predict_bespoke(ts_concs, receptor_tvals, site=None, use_dilution=0)#

Predict a bespoke time series source concentration to a give set of receptor time values. This is useful for predicting the concentration at a receptor point that is not in the original time series. Note that this is an advanced feature and may be more difficult to use

Parameters:
  • ts_concs – Pd.Series of source concentrations (index = age in years since start date, self.start_date) values = concentration, Note that the full period of the source data must be passed in, not just the time series of interest.

  • receptor_tvals – list of times to sample at (years since data starts, self.start_date)

  • site – Site in MultiSiteBASESolver etc.

  • use_dilution – Dilution fraction (to support MultiSiteDilutionBASESolver)

Returns:

pd.Series of receptor concentrations (index = receptor_tvals, values = concentration)

set_log_likelihood_scaler(scales)#

set the scaler for the log_likelihood for the model. These scalers are multiplied by the log_likelihood to yield the final log_likelihood. The used log likelihood for each site are calculated as log likelihood * log_likelihood[site]. If the log_likelihood are not set by the user then they are set to 1 for each site. The scalers must be greater than or equal to 0.

Parameters:

scales – dictionary {site=log_likelihood} (log_likelihood=number) or None to reset to 1 for all sites

Returns:

class MultiSiteDilutionBASESolver#

Bases: komanawa.BASE.multisite_base_solver.MultiSiteBASESolver

Inheritance diagram of komanawa.BASE.MultiSiteDilutionBASESolver

A MultiSiteBASESolver that includes dilution parameters for each site derived from a truncated half normal distribution.

The dilution parameter definitions are prescribed by the self.set_dilution_parameters method. Otherwise the model is the same as the MultiSiteBASESolver.

The dilution parameters are used to scale the source concentration at each site. Dilution is calculated as:

out_conc = receptor_concentration * (1 - dilution_parameter)

Basic Usage:

from komanawa.BASE import MultiSiteDilutionBASESolver, generators, BaseSampledParam
from pathlib import Path
import matplotlib.pyplot as plt

# Set up the problem
first_model_name = 'first_model'
n_inflections = 5  # this will increase to 7 with the addition of a start and end point
ndata_site1 = get_ndata1()  # get your data, this should be a pandas series with a datetime index
ndata_site2 = get_ndata2()  # get your data, this should be a pandas series with a datetime index
output_dir = Path.home()
cdf_threshold = 0.15
mrt_site1 = 12
f_p1_site1 = 0.7
mrt_site2 = 20
f_p1_site2 = 0.5

# create the solver
solver = MultiSiteDilutionBASESolver(
            save_dir=output_dir,
            ts_data=dict(site1=ndata_site1, site2=ndata_site2),
            mrt=dict(site1=mrt_site1, site2=mrt_site2),
            mrt_p1=dict(site1=mrt_site1, site2=mrt_site2),
            mrt_p2=dict(site1=None, site2=None),
            frac_p1=dict(site1=1, site2=1),
            f_p1=dict(site1=f_p1_site1, site2=f_p1_site2),
            f_p2=dict(site1=f_p1_site1, site2=f_p1_site2),
            n_inflections=n_inflections,
            cdf_inflection_start=cdf_threshold,
        )

# set the dilution priors
dilution_high = 0.75  # the maximum dilution
dilution_scale = 0.1  # the scale of the truncated half normal distribution
solver.set_dilution_parameters(dilution_high=dilution_high, dilution_scale=dilution_scale)

# set concentration priors
lows = [0.001, 0.001, 0.1, 0.1, 0.2, 0.5, 1.0]
ups = [0.5, 2, 6, 6, 6, 6, 6]
params = BaseSampledParam(distribution=generators.UniformPathGenerator, low=lows, up=ups, n=7)

# run the model
run = solver.run_base(model_name=first_model_name,
        params=params, verbose=False)

# plot the results
nplot = 0.2 # plot 20% of the results
fig, ax = solver.plot_best_source_receptor(first_model_name, nplot=nplot, receptor_spacing=1)
plt.show()
get_dilution_parameters(model_name, nplot, site=None)#

get the dilution parameters for each site

Parameters:
  • model_name – model name

  • nplot – number of results to plot (int) or fraction of results to plot (float, nplot<=1)

  • site – None or site to return

Returns:

pd.DataFrame or pd.Series of steady state concentrations

get_ss_conc(model_name, nplot, site=None)#

get the steady state concentration for each site (e.g. last source concentration * (1 - dilution_parameter))

Parameters:
  • model_name – model name

  • nplot – number of results to plot (int) or fraction of results to plot (float, nplot<=1)

  • site – None or site to return

Returns:

pd.DataFrame or pd.Series of steady state concentrations

plot_dilution_parameters(model_name, nplot, ax=None)#

plot the dilution parameters

Parameters:
  • model_name – str, the model name

  • nplot – int/float, the number of plots to make

  • ax – None or matplotlib axis, if None create a new figure

Returns:

plot_test_dilution_parameters(model_name, nplot, ktest_kwargs=None, plot_kwargs=None, ax=None)#

plot the results of the KS test on the dilution parameters

Parameters:
  • model_name – str, the model name

  • nplot – int/float, the number of plots to make

  • ktest_kwargs – None or dict, kwargs to pass to the ktest method (SciPy.stats.kstest)

  • plot_kwargs – None or dict, kwargs to pass to the plot method (Komanawa.BASE.utils.plot_ktest)

  • ax – None or matplotlib axis, if None create a new figure

Returns:

fig, ax, (handles, labels)

set_dilution_parameters(dilution_high, dilution_scale)#

Set the dilution parameters for each site. The dilution parameters are assumed to be derived from a truncated half normal distribution. The parameters are used to scale the source concentration at each site. Dilution is calculated as:

out_conc = receptor_concentration * (1 - dilution_parameter)

A dilution_parameter of 0 means no dilution, 0.75 means 75% the source concentration is lost to dilution.

Parameters:
  • dilution_high – float or dict of floats (one per site) the lower bound of the truncated half normal distribution (the maximum dilution)

  • dilution_scale – float or dict of floats (one per site) the scale of the truncated half normal distribution

Returns:

test_dilution_parameters(model_name, nplot, test=kstest, **kwargs)#

perform a test on the dilution parameters for the best n results

The test can be any test that takes two samples e.g. scipy.stats.kstest, Mann-Whitney U test, wasserstein_distance_nd, etc.

Parameters:
  • model_name – str, the model name

  • n_gen – int or float the number of generations to use, if None use all generations

  • test – function, the test to run on the dilution parameters

  • kwargs – passed to the kstest method

Returns:

{site:KstestResult, …} for each site

load_npz_source_receptor(npzfile, interpolate_source=True, convert_to_datetime=False)#

Load the source and receptor data from a npz file (self serialized by BASESolver.get_source_receptor_data)

Parameters:
  • npzfile – path to file

  • interpolate_source – bool to interpolate the source data

  • convert_to_datetime – bool to convert the source and receptor points to datetime objects instead of years since start date

Returns:

receptor_idvs: receptor identification variables (ndarray, shape=(nreceptor_points) 1:nsims with 1 being the most probable simulation)

Returns:

source_data: source concentration predictions (ndarray, shape=(nsims, source_points) (interpolated to self.precision if return_interpolated_source=True)

Returns:

source_points: source concentration timings years since start date (ndarray, shape=(self.n_inflections) if project_forward=None else shape=(self.n_inflections+1)(nsource_points,) (interpolated to self.precision if return_interpolated_source=True)

Returns:

outreceptor_data: receptor concentration predictions (ndarray, shape=(nsims, nreceptor_points)

Returns:

receptor_points: receptor timing years since start date (ndarray, shape=(nreceptor_points)