Caching data in nlmod

O.N. Ebbens, Artesia, 2021

Groundwater flow models are often data-intensive. Execution times can be shortened significantly by caching data. This notebooks explains how this caching is implemented in nlmod. The first three sections explain how to use the caching in nlmod. The last section contains more technical details on the implementation and limitations of caching in nlmod.

import os

import xarray as xr

import nlmod

nlmod.util.get_color_logger("INFO")
nlmod.show_versions()

Python version     : 3.11.14
NumPy version      : 2.4.4
Xarray version     : 2026.4.0
Matplotlib version : 3.10.9
Flopy version      : 3.10.0

nlmod version      : 0.11.3dev

Cache directory

When you create a model you usually start by assigning a model workspace. This is a directory where model data is stored. The nlmod.util.get_model_dirs() function can be used to create a file structure in two steps:

The model workspace directory is created if it does not exist yet.
Two subdirectories are created: ‘figure’ and ‘cache’.

Calling the function below we create the figdir and cachedir variables with the paths of the subdirectories. In this notebook we will use this cachedir to write and read cached data. It is possible to define your own cache directory.

model_ws = "05_caching"

# Model directories
figdir, cachedir = nlmod.util.get_model_dirs(model_ws)

print(model_ws)
print(figdir)
print(cachedir)

05_caching
05_caching/figure
05_caching/cache

Caching

In nlmod you can use the get_combined_layer_models function to obtain a layer model based on regis.

layer_model = nlmod.read.regis.get_combined_layer_models(
    extent=[95000.0, 105000.0, 494000.0, 500000.0], use_geotop=False
)

WARNING:nlmod.dims.layers.remove_layer_dim_from_top:Botm of layer is not equal to top of deeper layer in 8 cells

As you may notice, this function takes some time to complete because the data is downloaded and projected on the desired model grid. Everytime you run this function you have to wait for the process to finish which results in an unhealthy number of coffee breaks. This is why we use caching. To store our cache we use netCDF files. The layer_model variable is an xarray.Dataset. You can read/write an xarray.Dataset to/from a NetCDF file using the code below.

# write netcdf with layer model data
layer_model.to_netcdf(os.path.join(cachedir, "layer_test.nc"))

# read netcdf with layer model data
layer_model_from_cache = xr.open_dataset(
    os.path.join(cachedir, "layer_test.nc"), mask_and_scale=False, decode_coords="all"
)

# compare cache with original
assert layer_model_from_cache.equals(layer_model)

Reading and writing netcdf files is the main principle behind caching in nlmod. The output is written to a NetCDF file when the get_combined_layer_models function is called. The next function call the cached NetCDF file is read instead of running the function again. This reduces exuction time signficantly. You can simply use this caching abilities by specifying a cachedir and a cachename in the function call.

layer_model2 = nlmod.read.regis.get_combined_layer_models(
    extent=[95000.0, 105000.0, 494000.0, 500000.0],
    use_geotop=False,
    cachedir=cachedir,
    cachename="combined_layer_ds.nc",
)

WARNING:nlmod.dims.layers.remove_layer_dim_from_top:Botm of layer is not equal to top of deeper layer in 8 cells
INFO:nlmod.cache.wrapper:caching data -> combined_layer_ds.nc

Caching steps

The netCDF caching is applied to a number of functions in nlmod that have an xarray dataset as output. When you call these functions using the cachedir and cachename arguments the following steps are taken:

See if there is a netCDF file with the specified cachename in the specified cache directory. If the file exists go to step 2, otherwise go to step 3.
Read the netCDF file and return as an xarray dataset if:
1. The cached dataset was created using the same function arguments as the current function call.
2. The module where the function is defined has not been changed after the cache was created.
Run the function to obtain an xarray dataset. Save this dataset as a netCDF file, using the specified cachename and cache directory, for next time. Also return the dataset.

This is the flowchart of an ordinary function call:
image function call

This is the flowchart of a function call using the caching from nlmod:
image cache function call

Caching functions

The following functions use the caching as described above:

nlmod.read.regis.get_combined_layer_models
nlmod.read.regis.download_regis
nlmod.read.rws.download_surface_water
nlmod.read.rws.download_northsea
nlmod.read.knmi.get_recharge
nlmod.read.jarkus.download_bathymetry
nlmod.read.geotop.download_geotop
nlmod.read.ahn.download_ahn

Clearing the cache

Sometimes you want to get rid of all the cached files to free disk space or to support your minimalistic lifestyle. You can use the clear_cache function to clear all cached files in a specific cache directory. Note: when calling this function an extra confirmation [Y/N] is required to actually delete all the cache files.

# nlmod.cache.clear_cache(cachedir)

Technicalities and discussion

In nlmod we use a specific caching method called memoization. The memoization is implemented in the nlmod caching module. The cache_netcdf decorator function handles most of the magic for caching netcdf files. When the cache is created all function arguments are stored in a dictionary and saved (pickled) as a .pklz file. The check on function arguments (step 2A) is done by reading the pickle and comparing the output with the arguments of the current function call.

Notes

All function arguments are pickled and saved together with the netcdf file. If the function arguments use a lot of memory this process can be become slow. This should be taken into account when you decide whether to use caching.
Function arguments that cannot be pickled using the pickle module raise an error in the caching process.
If one of the function arguments is an xarray Dataset the argument check is somewhat different. Not all the data in a dataset it checked as it would take to much time. Therefore we use the nlmod.cache.ds_contains function to specify which coordinates, data variables and attributes should be checked. The arguments for this are specified in the cache decorator.
There is a check to see if the module where the function is defined has been changed since the cache was created. This helps not to use the cache when changes are made to the function which output is cached. Unfortunately when the function uses other functions from different modules these other modules are not checked for recent changes.
The cache_netcdf decorator uses functools.wraps and some homemade magic to add properties, such as the name and the docstring, of the original function to the decorated function. This assumes that the original function has a docstring with a “Returns” heading. If this is not the case the docstring is not modified.
There are some additional options for caching in the NLMOD_CACHE_OPTIONS. The options are:
- nc_hash: Save a hash of the stored netcdf file with the function arguments to ensure that the stored netcdf file and the pickled function arguments belong together. Default is True
- dataset_coords_hash: Save a hash of the dataset coordinates with the function arguments and check against the dataset coordinates that are used to call the function. This is to ensure that the cached file was created using the same dataset coordinates as the current function call, default is True.
- dataset_data_vars_hash: Save a hash of the data variables with the function arguments and check against the dataset variables that are used to call the function. This is to ensure that the cached file was created using the same dataset variables as the current function call, default is True.
- explicit_dataset_coordinate_comparison: If the dataset_coords_hash and dataset_data_vars_hash do not work (for example see this issue: https://github.com/gwmod/nlmod/issues/389) it is adviced to use an explicit check to see if the cached file was created using the same dataset coordinates as the current function call. This can be done with this settings, the default is False.

# show cache options
nlmod.config.NLMOD_CACHE_OPTIONS

{'nc_hash': True,
 'dataset_coords_hash': True,
 'dataset_data_vars_hash': True,
 'explicit_dataset_coordinate_comparison': False}

# set cache option
nlmod.config.set_options(nc_hash=False)

Storing cache on disk

Many memoization methods use a hash of the function arguments as the filename. Thus creating multiple files for different function calls. The memoization in nlmod uses a user-defined filename (cachename) to store the cache. If the function is called with different arguments the previous cached file is overwritten. By not creating a new file for every unique set of function arguments we reduce the number of files and therefore the memory size on the disk. By saving the function output as netCDF file it is also possible to read the file seperately from the caching process. While this is not something you would often do it can help when debugging.