Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should we recommend / illustrate / discuss the use of .nc compression, if not in trajan core, at least in an example? #144

Open
jerabaul29 opened this issue Nov 14, 2024 · 8 comments · Fixed by #152

Comments

@jerabaul29
Copy link
Collaborator

This issue is motivated by the following: .nc is the file obtained by .to_netcdf(), .zip is zipping the .nc file in my file explorer:

$ ls -lrth dataset_trajectories_to_use.*
-rw-rw-r-- 1 jeanr jeanr 8,5M nov.  14 15:54 dataset_trajectories_to_use.zip
-rw-rw-r-- 1 jeanr jeanr 100M nov.  14 16:00 dataset_trajectories_to_use.nc

clearly the .nc I had was not effectively compressed at all...

Should this be discussed in some example, and / or should we provide a "reasonable zipping for our typical use / needs as encountered in trajan" .to_netcdf() wrapper, or do you think this is outside the scope of trajan?

I guess for example that in our case, that is trajectory-focused, it could be realistic to compress each variable trajectory independently, so that we get good compression factor, and at the same time accessing any variable for one single trajectory would still be fast (ie need only to read and uncompress the compressed chunk that contains only this variable for the corresponding trajectory).

@gauteh
Copy link
Member

gauteh commented Nov 14, 2024

Yes, this is an annoying thing with xarray. I like examples and maybe there is a good way to do it, and probably there is xarray documentation. I personally use this:

https://github.com/gauteh/plz/blob/15300e4237c7071a670b8b7e8e6b101b01cab9b6/plz/xr.py#L72

Then I can do:

da.to_netcdf(encoding=plz.xr.nc_cmp(da))

@jerabaul29
Copy link
Collaborator Author

nice, yes this is exactly what I had in mind regarding the way to compress :) I can add an example about this! :)

The questions is, do we want to have this "just as an example", or as a default in trajan given that trajan is trajectory-focused which fits naturally well (I would be surprised if anyone complains about "per trajectory" variable compression)? What do you think?

@gauteh
Copy link
Member

gauteh commented Nov 14, 2024

It is a bit tricky to make general, and it will not be the right choice if Trajan is used to generate model output. At least intermediate output. If it can be made generic?

@knutfrode
Copy link
Contributor

What about making a wrapper of to_netcdf() (i.e. ds.traj.to_netcdf()) that makes typical (e.g. "per trajectory") chunking/compression by default?

Btw, sometimes (e.g. for simulated datasets) selecting a subset of time could as relevant as selecting subsets of trajectories. Thus we could have simple options to determine chunking size per dimension.

@jerabaul29
Copy link
Collaborator Author

I like the idea of .traj.to_netcdf() .

I guess we could have several ways to go forward with this:

  • an arg compression_kind="obs" or compression_kind="model" in the function call?
  • a way to discover automatically (based on a custom attribute? based on some heuristics about the dataset?) what kind of compression settings would be best?

What do you think? :)

@gauteh
Copy link
Member

gauteh commented Nov 15, 2024

I think if we make this method, it should forward almost everything to xarray.ds.to_netcdf(). So that we also support everything they support. Either we have a encoding='default' argument which tries to solve this, if not everything is forwarded without modification.

We should ideally also support to_zarr.

@jerabaul29
Copy link
Collaborator Author

actually, sorry I wrote "closes" in the PR a bit too fast; there is still the question with zarr, and / or if this could be made automated at some point :) . re-opening :) .

@jerabaul29 jerabaul29 reopened this Dec 3, 2024
@gauteh
Copy link
Member

gauteh commented Dec 3, 2024

I think we can experiment with a to_netcdf and to_zarr method, we should try to stay as close as possible to xarray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants