should we recommend / illustrate / discuss the use of .nc compression, if not in trajan core, at least in an example? #144

jerabaul29 · 2024-11-14T15:27:11Z

This issue is motivated by the following: .nc is the file obtained by .to_netcdf(), .zip is zipping the .nc file in my file explorer:

$ ls -lrth dataset_trajectories_to_use.*
-rw-rw-r-- 1 jeanr jeanr 8,5M nov.  14 15:54 dataset_trajectories_to_use.zip
-rw-rw-r-- 1 jeanr jeanr 100M nov.  14 16:00 dataset_trajectories_to_use.nc

clearly the .nc I had was not effectively compressed at all...

Should this be discussed in some example, and / or should we provide a "reasonable zipping for our typical use / needs as encountered in trajan" .to_netcdf() wrapper, or do you think this is outside the scope of trajan?

I guess for example that in our case, that is trajectory-focused, it could be realistic to compress each variable trajectory independently, so that we get good compression factor, and at the same time accessing any variable for one single trajectory would still be fast (ie need only to read and uncompress the compressed chunk that contains only this variable for the corresponding trajectory).

The text was updated successfully, but these errors were encountered:

gauteh · 2024-11-14T15:55:38Z

Yes, this is an annoying thing with xarray. I like examples and maybe there is a good way to do it, and probably there is xarray documentation. I personally use this:

https://github.com/gauteh/plz/blob/15300e4237c7071a670b8b7e8e6b101b01cab9b6/plz/xr.py#L72

Then I can do:

da.to_netcdf(encoding=plz.xr.nc_cmp(da))

jerabaul29 · 2024-11-14T16:09:07Z

nice, yes this is exactly what I had in mind regarding the way to compress :) I can add an example about this! :)

The questions is, do we want to have this "just as an example", or as a default in trajan given that trajan is trajectory-focused which fits naturally well (I would be surprised if anyone complains about "per trajectory" variable compression)? What do you think?

gauteh · 2024-11-14T17:34:30Z

It is a bit tricky to make general, and it will not be the right choice if Trajan is used to generate model output. At least intermediate output. If it can be made generic?

knutfrode · 2024-11-14T17:48:50Z

What about making a wrapper of to_netcdf() (i.e. ds.traj.to_netcdf()) that makes typical (e.g. "per trajectory") chunking/compression by default?

Btw, sometimes (e.g. for simulated datasets) selecting a subset of time could as relevant as selecting subsets of trajectories. Thus we could have simple options to determine chunking size per dimension.

jerabaul29 · 2024-11-15T08:52:51Z

I like the idea of .traj.to_netcdf() .

I guess we could have several ways to go forward with this:

an arg compression_kind="obs" or compression_kind="model" in the function call?
a way to discover automatically (based on a custom attribute? based on some heuristics about the dataset?) what kind of compression settings would be best?

What do you think? :)

gauteh · 2024-11-15T08:57:41Z

I think if we make this method, it should forward almost everything to xarray.ds.to_netcdf(). So that we also support everything they support. Either we have a encoding='default' argument which tries to solve this, if not everything is forwarded without modification.

We should ideally also support to_zarr.

jerabaul29 · 2024-12-03T08:29:29Z

actually, sorry I wrote "closes" in the PR a bit too fast; there is still the question with zarr, and / or if this could be made automated at some point :) . re-opening :) .

gauteh · 2024-12-03T08:32:26Z

I think we can experiment with a to_netcdf and to_zarr method, we should try to stay as close as possible to xarray.

jerabaul29 mentioned this issue Nov 28, 2024

Add example of compression, chunking by trajectory #152

Merged

gauteh closed this as completed in #152 Dec 3, 2024

jerabaul29 reopened this Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should we recommend / illustrate / discuss the use of .nc compression, if not in trajan core, at least in an example? #144

should we recommend / illustrate / discuss the use of .nc compression, if not in trajan core, at least in an example? #144

jerabaul29 commented Nov 14, 2024

gauteh commented Nov 14, 2024

jerabaul29 commented Nov 14, 2024

gauteh commented Nov 14, 2024

knutfrode commented Nov 14, 2024

jerabaul29 commented Nov 15, 2024

gauteh commented Nov 15, 2024

jerabaul29 commented Dec 3, 2024

gauteh commented Dec 3, 2024

should we recommend / illustrate / discuss the use of .nc compression, if not in trajan core, at least in an example? #144

should we recommend / illustrate / discuss the use of .nc compression, if not in trajan core, at least in an example? #144

Comments

jerabaul29 commented Nov 14, 2024

gauteh commented Nov 14, 2024

jerabaul29 commented Nov 14, 2024

gauteh commented Nov 14, 2024

knutfrode commented Nov 14, 2024

jerabaul29 commented Nov 15, 2024

gauteh commented Nov 15, 2024

jerabaul29 commented Dec 3, 2024

gauteh commented Dec 3, 2024