Caching data in nlmod
O.N. Ebbens, Artesia, 2021
Groundwater flow models are often data-intensive. Execution times can be shortened
significantly by caching data. This notebooks explains how this caching is implemented
in nlmod. The first three sections explain how to use the caching in nlmod. The last
section contains more technical details on the implementation and limitations of
caching in nlmod.
import os
import xarray as xr
import nlmod
nlmod.util.get_color_logger("INFO")
nlmod.show_versions()
Python version : 3.11.14
NumPy version : 2.4.4
Xarray version : 2026.4.0
Matplotlib version : 3.10.9
Flopy version : 3.10.0
nlmod version : 0.11.3dev
Cache directory
When you create a model you usually start by assigning a model workspace. This is a directory where model data is stored. The nlmod.util.get_model_dirs() function can be used to create a file structure in two steps:
The model workspace directory is created if it does not exist yet.
Two subdirectories are created: ‘figure’ and ‘cache’.
Calling the function below we create the figdir and cachedir variables with the paths of the subdirectories. In this notebook we will use this cachedir to write and read cached data. It is possible to define your own cache directory.
model_ws = "05_caching"
# Model directories
figdir, cachedir = nlmod.util.get_model_dirs(model_ws)
print(model_ws)
print(figdir)
print(cachedir)
05_caching
05_caching/figure
05_caching/cache
Caching
In nlmod you can use the get_combined_layer_models function to obtain a layer model based on regis.
layer_model = nlmod.read.regis.get_combined_layer_models(
extent=[95000.0, 105000.0, 494000.0, 500000.0], use_geotop=False
)
WARNING:nlmod.dims.layers.remove_layer_dim_from_top:Botm of layer is not equal to top of deeper layer in 8 cells
As you may notice, this function takes some time to complete because the data is downloaded and projected on the desired model grid. Everytime you run this function you have to wait for the process to finish which results in an unhealthy number of coffee breaks. This is why we use caching. To store our cache we use netCDF files. The layer_model variable is an xarray.Dataset. You can read/write an xarray.Dataset to/from a NetCDF file using the code below.
# write netcdf with layer model data
layer_model.to_netcdf(os.path.join(cachedir, "layer_test.nc"))
# read netcdf with layer model data
layer_model_from_cache = xr.open_dataset(
os.path.join(cachedir, "layer_test.nc"), mask_and_scale=False, decode_coords="all"
)
# compare cache with original
assert layer_model_from_cache.equals(layer_model)
Reading and writing netcdf files is the main principle behind caching in nlmod. The output is written to a NetCDF file when the get_combined_layer_models function is called. The next function call the cached NetCDF file is read instead of running the function again. This reduces exuction time signficantly. You can simply use this caching abilities by specifying a cachedir and a cachename in the function call.
layer_model2 = nlmod.read.regis.get_combined_layer_models(
extent=[95000.0, 105000.0, 494000.0, 500000.0],
use_geotop=False,
cachedir=cachedir,
cachename="combined_layer_ds.nc",
)
WARNING:nlmod.dims.layers.remove_layer_dim_from_top:Botm of layer is not equal to top of deeper layer in 8 cells
INFO:nlmod.cache.wrapper:caching data -> combined_layer_ds.nc
Caching steps
The netCDF caching is applied to a number of functions in nlmod that have an xarray dataset as output. When you call these functions using the cachedir and cachename arguments the following steps are taken:
See if there is a netCDF file with the specified cachename in the specified cache directory. If the file exists go to step 2, otherwise go to step 3.
Read the netCDF file and return as an xarray dataset if:
The cached dataset was created using the same function arguments as the current function call.
The module where the function is defined has not been changed after the cache was created.
Run the function to obtain an xarray dataset. Save this dataset as a netCDF file, using the specified cachename and cache directory, for next time. Also return the dataset.
This is the flowchart of an ordinary function call:

This is the flowchart of a function call using the caching from nlmod:

Caching functions
The following functions use the caching as described above:
nlmod.read.regis.get_combined_layer_modelsnlmod.read.regis.download_regisnlmod.read.rws.download_surface_waternlmod.read.rws.download_northseanlmod.read.knmi.get_rechargenlmod.read.jarkus.download_bathymetrynlmod.read.geotop.download_geotopnlmod.read.ahn.download_ahn
Checking the cache
One of the steps in the caching process is to check if the cache was created using the same function arguments as the current function call. This check has some limitations:
Only function arguments with certain types are checked. These types include: int, float, bool, str, bytes, list, tuple, dict, numpy.ndarray, pahtlib.PurePath, xarray.DataArray, pandas.DataFrame, pandas.Series, xarray.Dataset, flopy.mf6.ModflowGwf and flopy.utils.gridintersect.GridIntersect. If a function argument has a different type the cache is never used. In future development more types may be added to the checks.
If one of the function arguments is an xarray Dataset the check is somewhat different. Not all the data in a dataset it checked as it would take to much time. Therefore we use the
nlmod.cache.ds_containsfunction to specify which coordinates, data variables and attributes should be checked. The arguments for this are specified in the cache decorator. Users of nlmod do not have to worry about these.It is not possible to cache the results of a function with more than one xarray Dataset as an argument. This is due to the difference in checking datasets. If more than one xarray dataset is given the cache decoraters raises a TypeError.
If one of the function arguments is a filepath of type str we only check if the cached filepath is the same as the current filepath. We do not check if any changes were made to the file after the cache was created.
You can test how the caching works in different situations by running the function below a few times with different function arguments. The logs provide some information about using the cache or not.
# layer model
layer_model = nlmod.read.regis.get_combined_layer_models(
extent=[95000.0, 105000.0, 494000.0, 500000.0],
use_geotop=False,
cachename="combined_layer_ds.nc",
cachedir=cachedir,
)
layer_model
INFO:nlmod.cache.wrapper:using cached data -> combined_layer_ds.nc
<xarray.Dataset> Size: 3MB
Dimensions: (y: 60, x: 100, layer: 40)
Coordinates:
* y (y) float64 480B 5e+05 4.998e+05 ... 4.942e+05 4.94e+05
* x (x) float64 800B 9.505e+04 9.515e+04 ... 1.048e+05 1.05e+05
* layer (layer) <U6 960B 'HLc' 'BXz4' 'KRz3' ... 'OOz2' 'OOc' 'BRk1'
spatial_ref int64 8B 0
Data variables:
top (y, x) float32 24kB nan nan nan nan nan ... 2.29 1.49 0.65 0.37
botm (layer, y, x) float32 960kB nan nan nan ... -780.4 -780.7
kh (layer, y, x) float32 960kB nan nan nan nan ... nan nan nan nan
kv (layer, y, x) float32 960kB nan nan nan ... 0.002 0.002 0.002
Attributes: (12/41)
references: https://www.dinoloket.nl/regis-ii-het-hydr...
Conventions: CF-1.7
creator_url: https://www.dinoloket.nl
keywords_vocabulary: NASA/GCMD Earth Science Keywords. Version 6.0
acknowledgment: https://www.dinoloket.nl
project: REGIS v02r2s3
... ...
geospatial_vertical_min: -1235.92
geospatial_vertical_max: 322.75
geospatial_vertical_units: m-NAP
geospatial_vertical_positive: up
gridtype: structured
extent: [ 95000. 105000. 494000. 500000.]Clearing the cache
Sometimes you want to get rid of all the cached files to free disk space or to support your minimalistic lifestyle. You can use the clear_cache function to clear all cached files in a specific cache directory. Note: when calling this function an extra confirmation [Y/N] is required to actually delete all the cache files.
# nlmod.cache.clear_cache(cachedir)
Apply caching to a function
This section shows how to implement caching for a new function. Therefore we have a simple function that takes some input and returns an xarray Dataset. We’ve put this function in the module cache_example.py so we can import it in this notebook. In the module we used the nlmod.cache.cache_netcdf decorator on the function.
from cache_example import func_to_create_a_dataset
Now we can call the function to create a simple Dataset.
func_to_create_a_dataset(10)
<xarray.Dataset> Size: 2kB
Dimensions: (x: 100)
Coordinates:
* x (x) float64 800B 1.0 2.0 3.0 4.0 5.0 ... 96.0 97.0 98.0 99.0 100.0
Data variables:
test (x) float64 800B 10.0 10.0 10.0 10.0 10.0 ... 10.0 10.0 10.0 10.0We can call func_to_create_a_dataset using the cachedir and cachename arguments, even though we haven’t defined these arguments in our function definition. The arguments were added by the decorator.
ds = func_to_create_a_dataset(10, cachedir=cachedir, cachename="example")
ds
INFO:nlmod.cache.wrapper:caching data -> example.nc
<xarray.Dataset> Size: 2kB
Dimensions: (x: 100)
Coordinates:
* x (x) float64 800B 1.0 2.0 3.0 4.0 5.0 ... 96.0 97.0 98.0 99.0 100.0
Data variables:
test (x) float64 800B 10.0 10.0 10.0 10.0 10.0 ... 10.0 10.0 10.0 10.0The caching logic described in this notebook is applied to the function call. Next time we call this function it returns the cached dataset.
ds = func_to_create_a_dataset(10, cachedir=cachedir, cachename="example")
ds.close()
ds
INFO:nlmod.cache.wrapper:using cached data -> example.nc
<xarray.Dataset> Size: 2kB
Dimensions: (x: 100)
Coordinates:
* x (x) float64 800B 1.0 2.0 3.0 4.0 5.0 ... 96.0 97.0 98.0 99.0 100.0
Data variables:
test (x) float64 800B 10.0 10.0 10.0 10.0 10.0 ... 10.0 10.0 10.0 10.0Or it creates a new cached dataset if our function arguments have been changed:
func_to_create_a_dataset(11, cachedir=cachedir, cachename="example")
ds
INFO:nlmod.cache._same_function_arguments:cache was created using different function argument: arg0, do not use cached data
INFO:nlmod.cache.wrapper:caching data -> example.nc
<xarray.Dataset> Size: 2kB
Dimensions: (x: 100)
Coordinates:
* x (x) float64 800B 1.0 2.0 3.0 4.0 5.0 ... 96.0 97.0 98.0 99.0 100.0
Data variables:
test (x) float64 800B 10.0 10.0 10.0 10.0 10.0 ... 10.0 10.0 10.0 10.0When we decorate a function the cachedir and cachename arguments are also added to the docstring as you can see below:
# show that the arguments cachedir and cachename are added to the docstring
?func_to_create_a_dataset
Technicalities and discussion
In nlmod we use a specific caching method called memoization. The memoization is implemented in the nlmod caching module. The cache_netcdf decorator function handles most of the magic for caching netcdf files. When the cache is created all function arguments are stored in a dictionary and saved (pickled) as a .pklz file. The check on function arguments (step 2A) is done by reading the pickle and comparing the output with the arguments of the current function call.
Notes
All function arguments are pickled and saved together with the netcdf file. If the function arguments use a lot of memory this process can be become slow. This should be taken into account when you decide whether to use caching.
Function arguments that cannot be pickled using the
picklemodule raise an error in the caching process.If one of the function arguments is an xarray Dataset the argument check is somewhat different. Not all the data in a dataset it checked as it would take to much time. Therefore we use the
nlmod.cache.ds_containsfunction to specify which coordinates, data variables and attributes should be checked. The arguments for this are specified in the cache decorator.There is a check to see if the module where the function is defined has been changed since the cache was created. This helps not to use the cache when changes are made to the function which output is cached. Unfortunately when the function uses other functions from different modules these other modules are not checked for recent changes.
The
cache_netcdfdecorator usesfunctools.wrapsand some homemade magic to add properties, such as the name and the docstring, of the original function to the decorated function. This assumes that the original function has a docstring with a “Returns” heading. If this is not the case the docstring is not modified.There are some additional options for caching in the NLMOD_CACHE_OPTIONS. The options are:
nc_hash: Save a hash of the stored netcdf file with the function arguments to ensure that the stored netcdf file and the pickled function arguments belong together. Default is Truedataset_coords_hash: Save a hash of the dataset coordinates with the function arguments and check against the dataset coordinates that are used to call the function. This is to ensure that the cached file was created using the same dataset coordinates as the current function call, default is True.dataset_data_vars_hash: Save a hash of the data variables with the function arguments and check against the dataset variables that are used to call the function. This is to ensure that the cached file was created using the same dataset variables as the current function call, default is True.explicit_dataset_coordinate_comparison: If thedataset_coords_hashanddataset_data_vars_hashdo not work (for example see this issue: https://github.com/gwmod/nlmod/issues/389) it is adviced to use an explicit check to see if the cached file was created using the same dataset coordinates as the current function call. This can be done with this settings, the default is False.
# show cache options
nlmod.config.NLMOD_CACHE_OPTIONS
{'nc_hash': True,
'dataset_coords_hash': True,
'dataset_data_vars_hash': True,
'explicit_dataset_coordinate_comparison': False}
# set cache option
nlmod.config.set_options(nc_hash=False)
Storing cache on disk
Many memoization methods use a hash of the function arguments as the filename. Thus creating multiple files for different function calls. The memoization in nlmod uses a user-defined filename (cachename) to store the cache. If the function is called with different arguments the previous cached file is overwritten. By not creating a new file for every unique set of function arguments we reduce the number of files and therefore the memory size on the disk. By saving the function output as netCDF file it is also possible to read the file seperately from the caching process. While this is not something you would often do it can help when debugging.