A blog summarising my work is available at https://medium.com/pangeo/accessing-netcdf-and-grib-file-collections-as-cloud-native-virtual-datasets-using-kerchunk-625a2d0a9191
This document will be used to record the planning and progress of the Kerchunk GSOC project discussed in fsspec/kerchunk#166
Student: Peter Marsh Mentors: Rich Signell, Martin Durant
The work I have done over the past few months has ranged from helping to trouble shoot errors raised by the community, exploring ways to optimise and speed up the existing code base, creating convenience functions to execute common workflows and creating a new tutorial covering the full range of functionality offered by Kerchunk. I have also written an article highlighting the functionality of Kerchunk as well as providing two tutorials that I created during my time as a GSOC contributor.
May 20 - June 12: Community Bonding Period
June 13 - July 25 Phase 1 (July 29 Evaluation Due)
July 25 - September 4 Phase 2 (September 5 - September 12 Submit final Evaluations)
- Added to ESIP and IOOS Slack to communicate with mentors
- Added to and configured ESIP qhub jupyter lab enviroment
- Read through kerchunk gitter chat history to get up to speed
- Reproduced Kerchunk example notebooks, and updated notebooks which did not work on qhub
- Made my first pull request with the updated examples!
- Configured nb-stripout to remove outputs of jupyter notebooks when committing to github. This allows for easy 'diff'ing, where otherwise cell outputs make it impossible to see actual changes in code
- Updated Goes16 example to newer Kerchunk spec, to understand how the Kerchunk spec has changed from using
xarray_open_dataset()
to generate metadata to using nativeh5py
andzarr
methods - Made my first (very small) comment on an issue in the kerchunk repo helping a user solve a json vs ujson confusion
- Attended the GSOC 2022 virtual summit
- Configured a workflow to create a kerchunk sidecar file for the ERA5 public dataset on aws.
- While working with the ERA5 dataset encoutered the issue of dealing with fill_value's and set up a method to solve this using the postprocess argument, although I am still working on this to find a potentially neater solution.
- Opened issue fsspec/kerchunk#176 regarding the need to run consolidate again after postprocess when writing output to json
- Finalised ERA5-pds workflow and opened virtual dataset using kerchunk sidecar file
- Set up notebook to explore kerchunks handling of fill_value
- Made pull request fsspec/kerchunk#180 to run postprocess before consolidate, which has now been merged into kerchunk repo
- Made comment fsspec/kerchunk#177 (comment) in the fill_value vs _FillValue saga, to clarify that xarray only considers fill_value when opening from zarr
- Explored methods to speed up
MultiZarrtoZarr
by running in parrallel and opened fsspec/kerchunk#182 regarding this
- Made initial investigation into automatically converting NCL XML virtual datasets into kerchunk datasets (#7)
- Set up a workflow to produce a kerchunk virtual dataset for the NWM ensemble. (#8) (https://discourse.pangeo.io/t/efficient-access-of-ensemble-data-on-aws/2530/7)
- Confirmed fsspec/kerchunk#183 significantly speeds up the combine step in #5 (comment)
- Created a quick tutorial solving a users issue uging the coomap method in fsspec/kerchunk#184 (comment)
- Expanded the ERA5 sidecar files to a large selection of variables
- Discovered it is possible to add variables to an existing kerchunk sidecar file by simply using the python update dictionary method https://nbviewer.org/gist/peterm790/5015b90bb858fcd8ba922c5f764adf4d
- Set up a tutorial of how to open a kerchunk file mapping the ERA5 dataset and how to construct a simple sidecar file for the dataset https://nbviewer.org/gist/peterm790/23bb7a1484e576fa943e0b7e6c69d2e5
- Set up https://github.com/peterm790/ERA5_Kerchunk_tutorial which contains a simple tutorial to generate a sidecar file for ERA5 as well as an extended tutorial which runs through a number of number of different examples of using
MultiZarrtoZarr.combine
- Expanded the prose and descriptions of the tutorials
- Renamed the original Kerchunk tutorial to quick start and added the ERA5 tutorial to the kerchunk docs in fsspec/kerchunk#193
- Had an initial go at adding a convenince function to merge variables to existing datasets in fsspec/kerchunk#196
- Merge_vars convenience function modified and now merged into main
- Configured a docker image containing a pangeo python enviroment https://ghcr.io/peterm790/pangeo and a minimal kerchunk enviroment https://ghcr.io/peterm790/kerchunk utilising https://github.com/iameskild/repo2registry for use with kbatch and cronjob scripts. Which means kerchunk files can now be updated daily.
- Set up an example script create and open LiveOcean forecast data. Discussions on this now tracked at #6
- Updated HRRR to utilise the new scan_grib module. And updated the case study in the Kerchunk docs to match this fsspec/kerchunk#206
- investigated modifying combine to accomodate a list of lists input from scan_grib but decided the user should instead be writing each grib message as an individual json which the above case study now reflects. Still to check if changing this slightly to us
fs.cat()
could provide a speed up. https://github.com/fsspec/kerchunk/compare/main...peterm790:kerchunk:grib2_combine
- created a script to generate a dashboard from the ERA5 kerchunked data. This however is not running very smoothly and I suspec may be to do with the very small chunk sizing in the origin Netcdf files.
- Updated the ERA5 tutorial to instead be native restructured text and no longer rely on pandoc. This is now in a new PR fsspec/kerchunk#208 and the older PR has been closed.
- Experimented with a way to open the range of HRRR grib messages in an xarray datatree. This works to some extent but definitely still needs some work. https://gist.github.com/peterm790/b844fe0410d399f9ad8658377c744149
- Modified LiveOcean forecast reference update script to utilise etags to monitor file changes. #6 (comment)
- Meeting with Eskild from Quansight to understand how kbatch works and troubleshoot.
- Spent some time understanding Kubernetes and trying to configure K9s
- New tutorial fsspec/kerchunk#208 now merged.
- Updated HRRR case study in docs to point to new gist: https://nbviewer.org/gist/peterm790/92eb1df3d58ba41d3411f8a840be2452
- Had a meeting with Parker MacCready to trouble shoot implementing LiveOcean kerchunk workflow.
- Setup gist to investigate to what extent
combine
can be sped up and continued discussion in fsspec/kerchunk#200 (comment) - This led to
fs.cat
implementation being merged: fsspec/kerchunk#213 - Had an initial go at configuring
combine
to run in parallel here simply as a convenience function that takes the same arguments asMultiZarrtoZarr
: https://github.com/fsspec/kerchunk/compare/main...peterm790:kerchunk:dask_convenience_function
- PR to Pangeo forge to fix Kerchunk reference argument error, merged to main pangeo-forge/pangeo-forge-recipes#399
- Set up tutorial to open HRRR forecast as a datatree using Kerchunk: https://nbviewer.org/gist/peterm790/2439b1fe5fc781a9cc40281c9855affe
- Completed a final article highlighting the functionality of kerchunk as well as describing the outcomes of the ERA5 and HRRR tutorials I set up. This article will soon be published on the Pangeo Medium Site