{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Caching data in nlmod\n", "\n", "*O.N. Ebbens, Artesia, 2021*\n", "\n", "Groundwater flow models are often data-intensive. Execution times can be shortened\n", "significantly by caching data. This notebooks explains how this caching is implemented\n", "in `nlmod`. The first three sections explain how to use the caching in nlmod. The last\n", "section contains more technical details on the implementation and limitations of\n", "caching in nlmod." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import xarray as xr\n", "\n", "import nlmod" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nlmod.util.get_color_logger(\"INFO\")\n", "nlmod.show_versions()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cache directory\n", "\n", "When you create a model you usually start by assigning a model workspace. This is a directory where model data is stored. The `nlmod.util.get_model_dirs()` function can be used to create a file structure in two steps:\n", "1. The model workspace directory is created if it does not exist yet. \n", "2. Two subdirectories are created: 'figure' and 'cache'. \n", "\n", "Calling the function below we create the `figdir` and `cachedir` variables with the paths of the subdirectories. In this notebook we will use this `cachedir` to write and read cached data. It is possible to define your own cache directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_ws = \"05_caching\"\n", "\n", "# Model directories\n", "figdir, cachedir = nlmod.util.get_model_dirs(model_ws)\n", "\n", "print(model_ws)\n", "print(figdir)\n", "print(cachedir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Caching\n", "\n", "In `nlmod` you can use the `get_combined_layer_models` function to obtain a layer model based on `regis`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_model = nlmod.read.regis.get_combined_layer_models(\n", " extent=[95000.0, 105000.0, 494000.0, 500000.0], use_geotop=False\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you may notice, this function takes some time to complete because the data is downloaded and projected on the desired model grid. Everytime you run this function you have to wait for the process to finish which results in an unhealthy number of coffee breaks. This is why we use caching. To store our cache we use netCDF files. The `layer_model` variable is an `xarray.Dataset`. You can read/write an `xarray.Dataset` to/from a NetCDF file using the code below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# write netcdf with layer model data\n", "layer_model.to_netcdf(os.path.join(cachedir, \"layer_test.nc\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read netcdf with layer model data\n", "layer_model_from_cache = xr.open_dataset(\n", " os.path.join(cachedir, \"layer_test.nc\"), mask_and_scale=False, decode_coords=\"all\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# compare cache with original\n", "assert layer_model_from_cache.equals(layer_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reading and writing netcdf files is the main principle behind caching in `nlmod`. The output is written to a NetCDF file when the `get_combined_layer_models` function is called. The next function call the cached NetCDF file is read instead of running the function again. This reduces exuction time signficantly. You can simply use this caching abilities by specifying a `cachedir` and a `cachename` in the function call." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "layer_model2 = nlmod.read.regis.get_combined_layer_models(\n", " extent=[95000.0, 105000.0, 494000.0, 500000.0],\n", " use_geotop=False,\n", " cachedir=cachedir,\n", " cachename=\"combined_layer_ds.nc\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Caching steps\n", "\n", "The netCDF caching is applied to a number of functions in nlmod that have an xarray dataset as output. When you call these functions using the `cachedir` and `cachename` arguments the following steps are taken:\n", "\n", "1. See if there is a netCDF file with the specified cachename in the specified cache directory. If the file exists go to step 2, otherwise go to step 3.\n", "2. Read the netCDF file and return as an xarray dataset if:\n", " 1. The cached dataset was created using the same function arguments as the current function call. \n", " 2. The module where the function is defined has not been changed after the cache was created.\n", "3. Run the function to obtain an xarray dataset. Save this dataset as a netCDF file, using the specified cachename and cache directory, for next time. Also return the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the flowchart of an ordinary function call:
\n", "![image function call](img/ordinary_function_call.png)\n", "\n", "This is the flowchart of a function call using the caching from nlmod:
\n", "![image cache function call](img/cache_function_call.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Caching functions\n", "\n", "The following functions use the caching as described above:\n", "\n", "- `nlmod.read.regis.get_combined_layer_models`\n", "- `nlmod.read.regis.download_regis`\n", "- `nlmod.read.rws.download_surface_water`\n", "- `nlmod.read.rws.download_northsea`\n", "- `nlmod.read.knmi.get_recharge`\n", "- `nlmod.read.jarkus.download_bathymetry`\n", "- `nlmod.read.geotop.download_geotop`\n", "- `nlmod.read.ahn.download_ahn`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking the cache\n", "One of the steps in the caching process is to check if the cache was created using the same function arguments as the current function call. This check has some limitations:\n", "\n", "- Only function arguments with certain types are checked. These types include: int, float, bool, str, bytes, list, tuple, dict, numpy.ndarray, pahtlib.PurePath, xarray.DataArray, pandas.DataFrame, pandas.Series, xarray.Dataset, flopy.mf6.ModflowGwf and flopy.utils.gridintersect.GridIntersect. If a function argument has a different type the cache is never used. In future development more types may be added to the checks.\n", "- If one of the function arguments is an xarray Dataset the check is somewhat different. Not all the data in a dataset it checked as it would take to much time. Therefore we use the `nlmod.cache.ds_contains` function to specify which coordinates, data variables and attributes should be checked. The arguments for this are specified in the cache decorator. Users of nlmod do not have to worry about these.\n", "- It is not possible to cache the results of a function with more than one xarray Dataset as an argument. This is due to the difference in checking datasets. If more than one xarray dataset is given the cache decoraters raises a TypeError.\n", "- If one of the function arguments is a filepath of type str we only check if the cached filepath is the same as the current filepath. We do not check if any changes were made to the file after the cache was created.\n", "\n", "You can test how the caching works in different situations by running the function below a few times with different function arguments. The logs provide some information about using the cache or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# layer model\n", "layer_model = nlmod.read.regis.get_combined_layer_models(\n", " extent=[95000.0, 105000.0, 494000.0, 500000.0],\n", " use_geotop=False,\n", " cachename=\"combined_layer_ds.nc\",\n", " cachedir=cachedir,\n", ")\n", "layer_model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clearing the cache\n", "\n", "Sometimes you want to get rid of all the cached files to free disk space or to support your minimalistic lifestyle. You can use the `clear_cache` function to clear all cached files in a specific cache directory. Note: when calling this function an extra confirmation [Y/N] is required to actually delete all the cache files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# nlmod.cache.clear_cache(cachedir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apply caching to a function\n", "This section shows how to implement caching for a new function. Therefore we have a simple function that takes some input and returns an xarray Dataset. We've put this function in the module `cache_example.py` so we can import it in this notebook. In the module we used the `nlmod.cache.cache_netcdf` decorator on the function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from cache_example import func_to_create_a_dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can call the function to create a simple Dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func_to_create_a_dataset(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can call `func_to_create_a_dataset` using the `cachedir` and `cachename` arguments, even though we haven't defined these arguments in our function definition. The arguments were added by the decorator." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds = func_to_create_a_dataset(10, cachedir=cachedir, cachename=\"example\")\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The caching logic described in this notebook is applied to the function call. Next time we call this function it returns the cached dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds = func_to_create_a_dataset(10, cachedir=cachedir, cachename=\"example\")\n", "ds.close()\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or it creates a new cached dataset if our function arguments have been changed:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "func_to_create_a_dataset(11, cachedir=cachedir, cachename=\"example\")\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we decorate a function the cachedir and cachename arguments are also added to the docstring as you can see below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# show that the arguments cachedir and cachename are added to the docstring\n", "?func_to_create_a_dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Technicalities and discussion\n", "\n", "In nlmod we use a specific caching method called [memoization](https://en.wikipedia.org/wiki/Memoization). The memoization is implemented in the `nlmod` caching module. The `cache_netcdf` decorator function handles most of the magic for caching netcdf files. When the cache is created all function arguments are stored in a dictionary and saved (pickled) as a .pklz file. The check on function arguments (step 2A) is done by reading the pickle and comparing the output with the arguments of the current function call.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notes\n", "\n", "1. All function arguments are pickled and saved together with the netcdf file. If the function arguments use a lot of memory this process can be become slow. This should be taken into account when you decide whether to use caching.\n", "2. Function arguments that cannot be pickled using the `pickle` module raise an error in the caching process.\n", "3. If one of the function arguments is an xarray Dataset the argument check is somewhat different. Not all the data in a dataset it checked as it would take to much time. Therefore we use the `nlmod.cache.ds_contains` function to specify which coordinates, data variables and attributes should be checked. The arguments for this are specified in the cache decorator.\n", "4. There is a check to see if the module where the function is defined has been changed since the cache was created. This helps not to use the cache when changes are made to the function which output is cached. Unfortunately when the function uses other functions from different modules these other modules are not checked for recent changes.\n", "5. The `cache_netcdf` decorator uses `functools.wraps` and some homemade magic to add properties, such as the name and the docstring, of the original function to the decorated function. This assumes that the original function has a docstring with a \"Returns\" heading. If this is not the case the docstring is not modified.\n", "6. There are some additional options for caching in the NLMOD_CACHE_OPTIONS. The options are:\n", " - `nc_hash`: Save a hash of the stored netcdf file with the function arguments to ensure that the stored netcdf file and the pickled function arguments belong together. Default is True\n", " - `dataset_coords_hash`: Save a hash of the dataset coordinates with the function arguments and check against the dataset coordinates that are used to call the function. This is to ensure that the cached file was created using the same dataset coordinates as the current function call, default is True.\n", " - `dataset_data_vars_hash`: Save a hash of the data variables with the function arguments and check against the dataset variables that are used to call the function. This is to ensure that the cached file was created using the same dataset variables as the current function call, default is True.\n", " - `explicit_dataset_coordinate_comparison`: If the `dataset_coords_hash` and `dataset_data_vars_hash` do not work (for example see this issue: https://github.com/gwmod/nlmod/issues/389) it is adviced to use an explicit check to see if the cached file was created using the same dataset coordinates as the current function call. This can be done with this settings, the default is False." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# show cache options\n", "nlmod.config.NLMOD_CACHE_OPTIONS" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# set cache option\n", "nlmod.config.set_options(nc_hash=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Storing cache on disk\n", "\n", "Many memoization methods use a hash of the function arguments as the filename. Thus creating multiple files for different function calls. The memoization in `nlmod` uses a user-defined filename (`cachename`) to store the cache. If the function is called with different arguments the previous cached file is overwritten. By not creating a new file for every unique set of function arguments we reduce the number of files and therefore the memory size on the disk. By saving the function output as netCDF file it is also possible to read the file seperately from the caching process. While this is not something you would often do it can help when debugging. " ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 4 }