From e7b915121fd478600c2f0e48c1ec9abf7f81f945 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alejandro=20=C2=A9?= Date: Thu, 2 Nov 2023 16:10:18 +0000 Subject: [PATCH 1/3] bold overview --- tutorial/part3/chunking_introduction.ipynb | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tutorial/part3/chunking_introduction.ipynb b/tutorial/part3/chunking_introduction.ipynb index 3837dba..d7571c8 100644 --- a/tutorial/part3/chunking_introduction.ipynb +++ b/tutorial/part3/chunking_introduction.ipynb @@ -15,10 +15,11 @@ "source": [ "## Authors & Contributors\n", "### Authors\n", - "- Tina Odaka, Ifremer (France), [@tinaok](https://github.com/tinaok)\n", + "- Tina Odaka, UMR-LOPS Ifremer (France), [@tinaok](https://github.com/tinaok)\n", "- Pier Lorenzo Marasco, Ispra (Italy), [@pl-marasco](https://github.com/pl-marasco)\n", "\n", "### Contributors\n", + "- Alejandro Coca-Castro, The Alan Turing Institure, [acocac](https://github.com/acocac)\n", "- Anne Fouilloux, Simula Research Laboratory (Norway), [@annefou](https://github.com/annefou)\n", "- Guillaume Eynard-Bontemps, CNES (France), [@guillaumeeb](https://github.com/guillaumeeb)\n", "\n" @@ -30,7 +31,7 @@ "metadata": {}, "source": [ "
\n", - " Overview\n", + "Overview\n", "
\n", "
\n", " Questions\n", From 2c3a4600ea0f30b60eb6b39aa3d84744f1610e0f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alejandro=20=C2=A9?= Date: Thu, 2 Nov 2023 16:18:37 +0000 Subject: [PATCH 2/3] fix typos --- tutorial/part3/chunking_introduction.ipynb | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/tutorial/part3/chunking_introduction.ipynb b/tutorial/part3/chunking_introduction.ipynb index d7571c8..2e57419 100644 --- a/tutorial/part3/chunking_introduction.ipynb +++ b/tutorial/part3/chunking_introduction.ipynb @@ -548,7 +548,7 @@ "\n", "- Every chunk of a Zarr dataset is stored as a single file (see x.y files in `ls -al test.zarr/nobs`)\n", "- Each Data array in a Zarr dataset has a two unique files containing metadata:\n", - " - .zattrs for dataset or dataarray general metadatas\n", + " - .zattrs for dataset or dataarray general metadata\n", " - .zarray indicating how the dataarray is chunked, and where to find them on disk or other storage.\n", " \n", "Zarr can be considered as an Analysis Ready, cloud optimized data (ARCO) file format, discussed in [data_discovery](./data_discovery.ipynb) section." @@ -561,13 +561,13 @@ "source": [ "## Opening multiple NetCDF files and Kerchunk\n", "\n", - "As shown in the [Data discovery](./data_discovery.ipynb) chapter, when we have several files to read at once, we need to use Xarray `open_mfdataset`. When using `open_mfdataset` with NetCDF files, each NetCDF file is considerd as 'one chunk' by default as seen above.\n", + "As shown in the [Data discovery](./data_discovery.ipynb) chapter, when we have several files to read at once, we need to use Xarray `open_mfdataset`. When using `open_mfdataset` with NetCDF files, each NetCDF file is considered as 'one chunk' by default as seen above.\n", "\n", - "When calling `open_mfdataset`, Xarray also needs to analyse each NetCDF file to get metadatas and tried to build a coherent dataset from them. Thus, it performs multiple operations, like concartenate the coordinate, checking compatibility, etc. This can be time consuming ,especially when dealing with object storage or you have more than thousands of files. And this has to be repeated every time, even if we use exactly the same set of input files for different analysis.\n", + "When calling `open_mfdataset`, Xarray also needs to analyse each NetCDF file to get metadata and tried to build a coherent dataset from them. Thus, it performs multiple operations, like concatenate the coordinate, checking compatibility, etc. This can be time consuming ,especially when dealing with object storage or you have more than thousands of files. And this has to be repeated every time, even if we use exactly the same set of input files for different analysis.\n", "\n", "[Kerchunk library](https://fsspec.github.io/kerchunk/) can build virtual Zarr Dataset over NetCDF files which enables efficient access to the data from traditional file systems or cloud object storage.\n", "\n", - "And that is not the only optimisation kerchunk brings to pangeo ecosystem." + "And that is not the only optimization kerchunk brings to the Pangeo ecosystem." ] }, { @@ -577,7 +577,7 @@ "source": [ "### Exploiting native file chunks for reading datasets\n", "\n", - "As already mentioned, many data formats (for instance [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), [netCDF4](https://unidata.github.io/netcdf4-python/) with HDF5 backend, [geoTIFF](https://en.wikipedia.org/wiki/GeoTIFF)) have chunk capabilities. Chunks are defined at the creation of each file. Let's call them '__native file chunks__' to distinguish that from '__Dask chunks__'. These native file chunks can be retrieved and used when opening and accessing the files. This will allow to significantly reduce the amount of IOs, bandwith, and memory usage when analyzing Data Variables.\n", + "As already mentioned, many data formats (for instance [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), [netCDF4](https://unidata.github.io/netcdf4-python/) with HDF5 backend, [geoTIFF](https://en.wikipedia.org/wiki/GeoTIFF)) have chunk capabilities. Chunks are defined at the creation of each file. Let's call them '__native file chunks__' to distinguish that from '__Dask chunks__'. These native file chunks can be retrieved and used when opening and accessing the files. This will allow to significantly reduce the amount of IOs, bandwidth, and memory usage when analyzing Data Variables.\n", "\n", "[kerchunk library](https://fsspec.github.io/kerchunk/) can extract native file chunk layout and metadata from each file and combine them into one virtual Zarr dataset." ] @@ -590,8 +590,7 @@ "### Extract chunk information\n", "\n", "We extract native file chunk information from each NetCDF file using `kerchunk.hdf`.\n", - "Let's start with a single file.\n", - "\n" + "Let's start with a single file." ] }, { @@ -644,7 +643,7 @@ "source": [ "Let's have a look at `chunk_info`. It is a Python dictionary so we can use `pprint` to print it nicely.\n", "\n", - "Content is a bit complicated, but it's only metadata in Zarr format indicating what's in the original file, and where the chunks of the file are located (bytes offset). You can try to un comment next line to inspect the content. " + "Content is a bit complicated, but it's only metadata in Zarr format indicating what's in the original file, and where the chunks of the file are located (bytes offset). You can try to un comment next line to inspect the content." ] }, { @@ -733,7 +732,7 @@ "id": "1bb8ac14", "metadata": {}, "source": [ - "Let us first collect the chunk information for each file." + "Let's first collect the chunk information for each file." ] }, { @@ -933,6 +932,7 @@ "metadata": {}, "source": [ "The catalog (json file we created) can be shared on the cloud (or GitHub, etc.) and anyone can load it from there too.\n", + "\n", "This approach allows anyone to easily access LTS data and select the Area of Interest for their own study." ] }, @@ -942,7 +942,8 @@ "metadata": {}, "source": [ "We have prepared json file based on 36 netcdf file, and published it online as catalogue=\"https://object-store.cloud.muni.cz/swift/v1/foss4g-catalogue/c_gls_NDVI-LTS_1999-2019.json\"\n", - "We can try to load it.\n" + "\n", + "Let's try to load it." ] }, { From 451211e2cfb9609c5a5288beee563e1f6169cb8f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alejandro=20=C2=A9?= Date: Thu, 2 Nov 2023 16:22:48 +0000 Subject: [PATCH 3/3] change dask crossreference --- tutorial/part3/chunking_introduction.ipynb | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tutorial/part3/chunking_introduction.ipynb b/tutorial/part3/chunking_introduction.ipynb index 2e57419..b0f59ac 100644 --- a/tutorial/part3/chunking_introduction.ipynb +++ b/tutorial/part3/chunking_introduction.ipynb @@ -57,7 +57,7 @@ "\n", "When dealing with large data files or collections, it's often impossible to load all the data you want to analyze into a single computer's RAM at once. This is a situation where the Pangeo ecosystem can help you a lot. Xarray offers the possibility to work lazily on data __chunks__, which means pieces of an entire dataset. By reading a dataset in __chunks__ we can process our data piece by piece on a single computer and even on a distributed computing cluster using Dask (Cloud or HPC for instance).\n", "\n", - "How we will process these 'chunks' in a parallel environment will be discussed in [dask_introduction](./dask_introduction.ipynb). The concept of __chunk__ will be explained here.\n", + "How we will process these 'chunks' in a parallel environment will be discussed in [the Scaling with Dask](./scaling_dask.ipynb). The concept of __chunk__ will be explained here.\n", "\n", "When we process our data piece by piece, it's easier to have our input or ouput data also saved in __chunks__. [Zarr](https://zarr.readthedocs.io/en/stable/) is the reference library in the Pangeo ecosystem to save our Xarray multidimentional datasets in __chunks__.\n", "\n", @@ -378,7 +378,7 @@ "source": [ "` test.data` is the backend array Python representation of Xarray's Data Array, [__Dask Array__](https://docs.dask.org/en/stable/array.html) when using chunking, Numpy by default.\n", "\n", - "We will introduce Dask arrays and Dask graphs visualization in the next section [dask_introduction](./dask_introduction.ipynb)." + "We will introduce Dask arrays and Dask graphs visualization in the next section [Scaling with Dask](./scaling_dask.ipynb)." ] }, { @@ -993,7 +993,7 @@ "id": "ca50a427-f7ca-497d-b4bf-359b68c56f07", "metadata": {}, "source": [ - "We will use this catalogue in [dask_introduction](./dask_introduction.ipynb) chapter. " + "We will use this catalogue in [the Scaling with Dask](./scaling_dask.ipynb) chapter." ] }, {