Skip to content

Commit

Permalink
deploy: 2a4422e
Browse files Browse the repository at this point in the history
  • Loading branch information
pl-marasco committed Nov 3, 2023
1 parent a5bfadf commit 0d3b50f
Show file tree
Hide file tree
Showing 7 changed files with 1,673 additions and 1,946 deletions.
32 changes: 15 additions & 17 deletions _sources/part3/chunking_introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,10 @@
"source": [
"## Authors & Contributors\n",
"### Authors\n",
"- Tina Odaka, UMR-LOPS Ifremer (France), [@tinaok](https://github.com/tinaok)\n",
"- Tina Odaka, Ifremer (France), [@tinaok](https://github.com/tinaok)\n",
"- Pier Lorenzo Marasco, Ispra (Italy), [@pl-marasco](https://github.com/pl-marasco)\n",
"\n",
"### Contributors\n",
"- Alejandro Coca-Castro, The Alan Turing Institure, [acocac](https://github.com/acocac)\n",
"- Anne Fouilloux, Simula Research Laboratory (Norway), [@annefou](https://github.com/annefou)\n",
"- Guillaume Eynard-Bontemps, CNES (France), [@guillaumeeb](https://github.com/guillaumeeb)\n",
"\n"
Expand All @@ -31,7 +30,7 @@
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"<i class=\"fa-question-circle fa\" style=\"font-size: 22px;color:#666;\"></i><b>Overview</b>\n",
"<i class=\"fa-question-circle fa\" style=\"font-size: 22px;color:#666;\"></i> Overview\n",
" <br>\n",
" <br>\n",
" <b>Questions</b>\n",
Expand All @@ -57,7 +56,7 @@
"\n",
"When dealing with large data files or collections, it's often impossible to load all the data you want to analyze into a single computer's RAM at once. This is a situation where the Pangeo ecosystem can help you a lot. Xarray offers the possibility to work lazily on data __chunks__, which means pieces of an entire dataset. By reading a dataset in __chunks__ we can process our data piece by piece on a single computer and even on a distributed computing cluster using Dask (Cloud or HPC for instance).\n",
"\n",
"How we will process these 'chunks' in a parallel environment will be discussed in [the Scaling with Dask](./scaling_dask.ipynb). The concept of __chunk__ will be explained here.\n",
"How we will process these 'chunks' in a parallel environment will be discussed in [dask_introduction](./dask_introduction.ipynb). The concept of __chunk__ will be explained here.\n",
"\n",
"When we process our data piece by piece, it's easier to have our input or ouput data also saved in __chunks__. [Zarr](https://zarr.readthedocs.io/en/stable/) is the reference library in the Pangeo ecosystem to save our Xarray multidimentional datasets in __chunks__.\n",
"\n",
Expand Down Expand Up @@ -378,7 +377,7 @@
"source": [
"` test.data` is the backend array Python representation of Xarray's Data Array, [__Dask Array__](https://docs.dask.org/en/stable/array.html) when using chunking, Numpy by default.\n",
"\n",
"We will introduce Dask arrays and Dask graphs visualization in the next section [Scaling with Dask](./scaling_dask.ipynb)."
"We will introduce Dask arrays and Dask graphs visualization in the next section [dask_introduction](./dask_introduction.ipynb)."
]
},
{
Expand Down Expand Up @@ -548,7 +547,7 @@
"\n",
"- Every chunk of a Zarr dataset is stored as a single file (see x.y files in `ls -al test.zarr/nobs`)\n",
"- Each Data array in a Zarr dataset has a two unique files containing metadata:\n",
" - .zattrs for dataset or dataarray general metadata\n",
" - .zattrs for dataset or dataarray general metadatas\n",
" - .zarray indicating how the dataarray is chunked, and where to find them on disk or other storage.\n",
" \n",
"Zarr can be considered as an Analysis Ready, cloud optimized data (ARCO) file format, discussed in [data_discovery](./data_discovery.ipynb) section."
Expand All @@ -561,13 +560,13 @@
"source": [
"## Opening multiple NetCDF files and Kerchunk\n",
"\n",
"As shown in the [Data discovery](./data_discovery.ipynb) chapter, when we have several files to read at once, we need to use Xarray `open_mfdataset`. When using `open_mfdataset` with NetCDF files, each NetCDF file is considered as 'one chunk' by default as seen above.\n",
"As shown in the [Data discovery](./data_discovery.ipynb) chapter, when we have several files to read at once, we need to use Xarray `open_mfdataset`. When using `open_mfdataset` with NetCDF files, each NetCDF file is considerd as 'one chunk' by default as seen above.\n",
"\n",
"When calling `open_mfdataset`, Xarray also needs to analyse each NetCDF file to get metadata and tried to build a coherent dataset from them. Thus, it performs multiple operations, like concatenate the coordinate, checking compatibility, etc. This can be time consuming ,especially when dealing with object storage or you have more than thousands of files. And this has to be repeated every time, even if we use exactly the same set of input files for different analysis.\n",
"When calling `open_mfdataset`, Xarray also needs to analyse each NetCDF file to get metadatas and tried to build a coherent dataset from them. Thus, it performs multiple operations, like concartenate the coordinate, checking compatibility, etc. This can be time consuming ,especially when dealing with object storage or you have more than thousands of files. And this has to be repeated every time, even if we use exactly the same set of input files for different analysis.\n",
"\n",
"[Kerchunk library](https://fsspec.github.io/kerchunk/) can build virtual Zarr Dataset over NetCDF files which enables efficient access to the data from traditional file systems or cloud object storage.\n",
"\n",
"And that is not the only optimization kerchunk brings to the Pangeo ecosystem."
"And that is not the only optimisation kerchunk brings to pangeo ecosystem."
]
},
{
Expand All @@ -577,7 +576,7 @@
"source": [
"### Exploiting native file chunks for reading datasets\n",
"\n",
"As already mentioned, many data formats (for instance [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), [netCDF4](https://unidata.github.io/netcdf4-python/) with HDF5 backend, [geoTIFF](https://en.wikipedia.org/wiki/GeoTIFF)) have chunk capabilities. Chunks are defined at the creation of each file. Let's call them '__native file chunks__' to distinguish that from '__Dask chunks__'. These native file chunks can be retrieved and used when opening and accessing the files. This will allow to significantly reduce the amount of IOs, bandwidth, and memory usage when analyzing Data Variables.\n",
"As already mentioned, many data formats (for instance [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), [netCDF4](https://unidata.github.io/netcdf4-python/) with HDF5 backend, [geoTIFF](https://en.wikipedia.org/wiki/GeoTIFF)) have chunk capabilities. Chunks are defined at the creation of each file. Let's call them '__native file chunks__' to distinguish that from '__Dask chunks__'. These native file chunks can be retrieved and used when opening and accessing the files. This will allow to significantly reduce the amount of IOs, bandwith, and memory usage when analyzing Data Variables.\n",
"\n",
"[kerchunk library](https://fsspec.github.io/kerchunk/) can extract native file chunk layout and metadata from each file and combine them into one virtual Zarr dataset."
]
Expand All @@ -590,7 +589,8 @@
"### Extract chunk information\n",
"\n",
"We extract native file chunk information from each NetCDF file using `kerchunk.hdf`.\n",
"Let's start with a single file."
"Let's start with a single file.\n",
"\n"
]
},
{
Expand Down Expand Up @@ -643,7 +643,7 @@
"source": [
"Let's have a look at `chunk_info`. It is a Python dictionary so we can use `pprint` to print it nicely.\n",
"\n",
"Content is a bit complicated, but it's only metadata in Zarr format indicating what's in the original file, and where the chunks of the file are located (bytes offset). You can try to un comment next line to inspect the content."
"Content is a bit complicated, but it's only metadata in Zarr format indicating what's in the original file, and where the chunks of the file are located (bytes offset). You can try to un comment next line to inspect the content. "
]
},
{
Expand Down Expand Up @@ -732,7 +732,7 @@
"id": "1bb8ac14",
"metadata": {},
"source": [
"Let's first collect the chunk information for each file."
"Let us first collect the chunk information for each file."
]
},
{
Expand Down Expand Up @@ -932,7 +932,6 @@
"metadata": {},
"source": [
"The catalog (json file we created) can be shared on the cloud (or GitHub, etc.) and anyone can load it from there too.\n",
"\n",
"This approach allows anyone to easily access LTS data and select the Area of Interest for their own study."
]
},
Expand All @@ -942,8 +941,7 @@
"metadata": {},
"source": [
"We have prepared json file based on 36 netcdf file, and published it online as catalogue=\"https://object-store.cloud.muni.cz/swift/v1/foss4g-catalogue/c_gls_NDVI-LTS_1999-2019.json\"\n",
"\n",
"Let's try to load it."
"We can try to load it.\n"
]
},
{
Expand Down Expand Up @@ -993,7 +991,7 @@
"id": "ca50a427-f7ca-497d-b4bf-359b68c56f07",
"metadata": {},
"source": [
"We will use this catalogue in [the Scaling with Dask](./scaling_dask.ipynb) chapter."
"We will use this catalogue in [dask_introduction](./dask_introduction.ipynb) chapter. "
]
},
{
Expand Down
Loading

0 comments on commit 0d3b50f

Please sign in to comment.