Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example of compression, chunking by trajectory #152

Merged
merged 4 commits into from
Dec 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions examples/example_compress.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
"""
Examples of compressing data when saving to .nc
==============================================================================
"""

# %%

import xarray as xr
from trajan.readers.omb import read_omb_csv
from pathlib import Path
import os

# %%

path_to_test_data = Path.cwd().parent / "tests" / "test_data" / "csv" / "omb_large.csv"
xr_buoys = read_omb_csv(path_to_test_data)

# %%

# by default, to_netcdf does not perform any compression
xr_buoys.to_netcdf("no_compression.nc")

# on my machine, this is around 33MB
print(f"size no compression: {round(os.stat('no_compression.nc').st_size/(pow(1024,2)), 2)} MB")

# %%

# one can perform compression by providing explicitly the right arguments
# note that the best way to compress may depend on your dataset, the access
# pattern you want to be fastest, etc - be aware of memory layout and
# performance!

# a simple compression, on a per-trajectory basis: each trajectory will
# be compressed as a chunk, this means that it will be fast to retrieve one
# full trajectory, but slow to retrieve e.g. the 5th point of all trajectories.

# choose the encoding chunking - this may be application dependent, here
# chunk trajectory as a whole
def generate_chunksize(var):
dims = xr_buoys[var].dims
shape = list(xr_buoys[var].shape)

idx_trajectory = dims.index("trajectory")
shape[idx_trajectory] = 1

return tuple(shape)


# set the encoding for each variable
encoding = {
var: {"zlib": True, "complevel": 5, "chunksizes": generate_chunksize(var)} \
for var in xr_buoys.data_vars
}

# the encoding looks like:
for var in encoding:
print(f"{var}: {encoding[var] = }")
print("")

# save, this time with compression
xr_buoys.to_netcdf("trajectory_compression.nc", encoding=encoding)

# on my machine, this is around 5.6MB
print(f"size with compression: {round(os.stat('trajectory_compression.nc').st_size/(pow(1024,2)), 2)} MB")

# %%
Loading
Loading