use-case issues #8786
Replies: 1 comment 1 reply
-
Hi @mullenkamp -- thanks for sharing your use case / pain points here. I gather you have struggled to get Xarray+Dask to work effectively for your use case. I'm curious if that is right and if you can elaborate a bit more on what problems you ran into there. In principle, it should be possible to use Xarray and Dask together to stream data from one source to another (including an HDF5 file). A few specific comments below:
Great! This was one of the things I hoped to share with you here so I'm glad you already found it.
We typically point folks to one of two options here. 1. Use DaskIf you workflow allows, this is typically a great place to start. This will stream data into your output netcdf file chunk-by-chunk. ds = xr.open_dataset(..., chunks='auto')
ds_out = ds.sel(...)
ds_out.to_netcdf(...) Note 1: my example above uses the auto chunking option. You can always override this if you believe your application would perform better with custom chunk sizes. _Note 2: this should work with the 2. Use Zarr (with or without dask)Zarr is a great format for incremental writes. There are a few ways to do this with Xarray today. The first is to append along a single dimension: ds_out.to_zarr(store, append_dim='time') The second option is to use region writes. These will write to arbitrary blocks within a Zarr array: ds_out.to_zarr(store, region={'time': slice(100, 200)}) Have you tried some of these?
There is a discussion about whether indexing should be lazy or eager but I don't see Zarr itself doing much more than that. We're certainly not entertaining a dependency on any library like Dask at this time. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I was asked by @jhamman to make a discussion regarding the issues I've had with certain use-cases that seem to fall outside of xarray and other aspects of xarray I struggle with.
My normal use-case
I deal with large meteological and hydrological datasets. My typical procedure for processing/analysing data is that I need to read in a large dataset, slice the section of the data I want, do some processing/analysis on it, then write it out to another file that a colleague (or myself) will want to do some subsequent processing/analysis on. I may also have a bunch of smaller input and I run a numeric model on the input and subsequently create a large dataset that I need to write to a file.
Since the input/output data is typically larger than my RAM, I need to iterate through the large slice of the data I've selected. By default, xarray will gradually cache all the data that loads into the original object if the user doesn't make a copy of it first. So to iterate through smaller chunks of the larger slice in xarray, I need to figure out how I should make the smaller slices (likely by how the data has been chunked in the hdf5/netcdf4 file), then each slice I make I need to add the
.copy().load().values
methods/attributes to ensure xarray doesn't gradually add all of that data to the cache. I recently learned that I can completely turn off the cache via the xr.open_dataset function, which is great!But the bigger issue is writing out data iteratively. Please correct my if I'm wrong, but I've found no way to iteratively write to a dataset in chunks like I can via h5py. xarray (like pandas) needs to have the entire dataset in memory to write it out to a file. I know you can add more datasets to an existing netcdf4 file, but not iteratively write to a file like you can in h5py as I would intuitively do if I was assigning values to a numpy array. I do get the sense that this limitation is more due to the python-netcdf4 library than xarray. And this is why I switched to h5py. It's been designed in a way to fit my normal use-case. The main disadvantage is that h5py is designed for a more generic hdf5 structure, while I (and my colleagues) really prefer the netcdf (CF conventions) more stringent structure.
I've been trying out zarr every now-and-then, but from discussions I've recently had with the developers they seem to be going to route of the Dask mindset which feels more opaque and less straightforward for my use-case.
I'd love some feedback to know if I'm doing something wrong as I'm sure I don't know all the possible functionality of xarray or other python packages.
Beta Was this translation helpful? Give feedback.
All reactions