use-case issues #8786

mullenkamp · 2024-02-26T21:08:19Z

mullenkamp
Feb 26, 2024

Hi,

I was asked by @jhamman to make a discussion regarding the issues I've had with certain use-cases that seem to fall outside of xarray and other aspects of xarray I struggle with.

My normal use-case

I deal with large meteological and hydrological datasets. My typical procedure for processing/analysing data is that I need to read in a large dataset, slice the section of the data I want, do some processing/analysis on it, then write it out to another file that a colleague (or myself) will want to do some subsequent processing/analysis on. I may also have a bunch of smaller input and I run a numeric model on the input and subsequently create a large dataset that I need to write to a file.

Since the input/output data is typically larger than my RAM, I need to iterate through the large slice of the data I've selected. By default, xarray will gradually cache all the data that loads into the original object if the user doesn't make a copy of it first. So to iterate through smaller chunks of the larger slice in xarray, I need to figure out how I should make the smaller slices (likely by how the data has been chunked in the hdf5/netcdf4 file), then each slice I make I need to add the .copy().load().values methods/attributes to ensure xarray doesn't gradually add all of that data to the cache. I recently learned that I can completely turn off the cache via the xr.open_dataset function, which is great!

But the bigger issue is writing out data iteratively. Please correct my if I'm wrong, but I've found no way to iteratively write to a dataset in chunks like I can via h5py. xarray (like pandas) needs to have the entire dataset in memory to write it out to a file. I know you can add more datasets to an existing netcdf4 file, but not iteratively write to a file like you can in h5py as I would intuitively do if I was assigning values to a numpy array. I do get the sense that this limitation is more due to the python-netcdf4 library than xarray. And this is why I switched to h5py. It's been designed in a way to fit my normal use-case. The main disadvantage is that h5py is designed for a more generic hdf5 structure, while I (and my colleagues) really prefer the netcdf (CF conventions) more stringent structure.

I've been trying out zarr every now-and-then, but from discussions I've recently had with the developers they seem to be going to route of the Dask mindset which feels more opaque and less straightforward for my use-case.

I'd love some feedback to know if I'm doing something wrong as I'm sure I don't know all the possible functionality of xarray or other python packages.

jhamman · 2024-02-26T23:07:26Z

jhamman
Feb 26, 2024
Maintainer

Hi @mullenkamp -- thanks for sharing your use case / pain points here. I gather you have struggled to get Xarray+Dask to work effectively for your use case. I'm curious if that is right and if you can elaborate a bit more on what problems you ran into there. In principle, it should be possible to use Xarray and Dask together to stream data from one source to another (including an HDF5 file). A few specific comments below:

I recently learned that I can completely turn off the cache via the xr.open_dataset function, which is great!

Great! This was one of the things I hoped to share with you here so I'm glad you already found it.

the bigger issue is writing out data iteratively

We typically point folks to one of two options here.

1. Use Dask

If you workflow allows, this is typically a great place to start. This will stream data into your output netcdf file chunk-by-chunk.

ds = xr.open_dataset(..., chunks='auto')
ds_out = ds.sel(...)
ds_out.to_netcdf(...)

Note 1: my example above uses the auto chunking option. You can always override this if you believe your application would perform better with custom chunk sizes.

_Note 2: this should work with the netcdf4 and h5netcdf backends.

2. Use Zarr (with or without dask)

Zarr is a great format for incremental writes. There are a few ways to do this with Xarray today. The first is to append along a single dimension:

ds_out.to_zarr(store, append_dim='time')

The second option is to use region writes. These will write to arbitrary blocks within a Zarr array:

ds_out.to_zarr(store, region={'time': slice(100, 200)})

Have you tried some of these?

I've been trying out zarr every now-and-then, but from discussions I've recently had with the developers they seem to be going to route of the Dask mindset which feels more opaque and less straightforward for my use-case.

There is a discussion about whether indexing should be lazy or eager but I don't see Zarr itself doing much more than that. We're certainly not entertaining a dependency on any library like Dask at this time.

1 reply

mullenkamp Feb 27, 2024
Author

Hi @jhamman . Thanks for taking the time to reply to me. I appreciate it.
When I first ran into these problem 3-4 years ago, I tried to use Dask. I quickly ran into issues on how Dask would handle and process the data. I'd frequently get unnecessary long running processes or the RAM would top out even after trying desperately to optimise the chunk sizing. I even started digging deeply into using Dask directly...but at some point it felt like the whole process was getting very complicated to do simple tasks. That's why I started simply iterating through the chunks as I described earlier.

Nevertheless, the (1) option you gave me assumes all of the processing tasks can be queued up in Dask. If not, then you'll still run into memory issues.

I'll try to illustrate more clearly how I do my normal processing using h5py:

path = '/path/to/output.h5'
shape = (100000, 300)

f = h5py.File(path, 'w')
ds = f.create_dataset('test_ds', shape=shape, dtype='int32', chunks=(shape[0], 1))

for chunk in ds.iter_chunks():
    arr = np.full((shape[0], 1), chunk[1].stop, dtype='int32') 
    # Imagine the above line is doing a ton of processing to create this array
    # Possibly reading from another file
    ds[chunk] = arr

f.close()

This is what I mean by iterating through the chunks. This ensures that the largest amount of data stored in RAM is the size of arr. This also feels more clear from a programming perspective. zarr alone (V2) allows for this style (except I haven't found an equivalent to .iter_chunks), but I hadn't figured out a way to do this in xarray.

Your recommendation ds_out.to_zarr(store, region={'time': slice(100, 200)}) seems the closest to accomplishing this, but I assume I'd have to create the xr.Dataset chunk (with all the associated coordinates). And that's fine. I'll need to check how performant this is vs the more typical assignment I described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use-case issues #8786

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

use-case issues #8786

mullenkamp Feb 26, 2024

My normal use-case

Replies: 1 comment · 1 reply

jhamman Feb 26, 2024 Maintainer

1. Use Dask

2. Use Zarr (with or without dask)

mullenkamp Feb 27, 2024 Author

mullenkamp
Feb 26, 2024

Replies: 1 comment 1 reply

jhamman
Feb 26, 2024
Maintainer

mullenkamp Feb 27, 2024
Author