Skip to content

Latest commit

 

History

History
99 lines (73 loc) · 8.88 KB

GSOC_monitor.md

File metadata and controls

99 lines (73 loc) · 8.88 KB

Kerchunk Google Summer of Code 2022

A blog summarising my work is available at https://medium.com/pangeo/accessing-netcdf-and-grib-file-collections-as-cloud-native-virtual-datasets-using-kerchunk-625a2d0a9191

This document will be used to record the planning and progress of the Kerchunk GSOC project discussed in fsspec/kerchunk#166

Student: Peter Marsh Mentors: Rich Signell, Martin Durant

The work I have done over the past few months has ranged from helping to trouble shoot errors raised by the community, exploring ways to optimise and speed up the existing code base, creating convenience functions to execute common workflows and creating a new tutorial covering the full range of functionality offered by Kerchunk. I have also written an article highlighting the functionality of Kerchunk as well as providing two tutorials that I created during my time as a GSOC contributor.

GSOC Schedule

May 20 - June 12: Community Bonding Period

June 13 - July 25 Phase 1 (July 29 Evaluation Due)

July 25 - September 4 Phase 2 (September 5 - September 12 Submit final Evaluations)

GSOC Progress Monitor

Week 1 (25 May - 1 June)

  • Added to ESIP and IOOS Slack to communicate with mentors
  • Added to and configured ESIP qhub jupyter lab enviroment
  • Read through kerchunk gitter chat history to get up to speed
  • Reproduced Kerchunk example notebooks, and updated notebooks which did not work on qhub
  • Made my first pull request with the updated examples!

Week 2 (1 June - 8 June)

  • Configured nb-stripout to remove outputs of jupyter notebooks when committing to github. This allows for easy 'diff'ing, where otherwise cell outputs make it impossible to see actual changes in code
  • Updated Goes16 example to newer Kerchunk spec, to understand how the Kerchunk spec has changed from using xarray_open_dataset() to generate metadata to using native h5py and zarr methods
  • Made my first (very small) comment on an issue in the kerchunk repo helping a user solve a json vs ujson confusion
  • Attended the GSOC 2022 virtual summit
  • Configured a workflow to create a kerchunk sidecar file for the ERA5 public dataset on aws.
  • While working with the ERA5 dataset encoutered the issue of dealing with fill_value's and set up a method to solve this using the postprocess argument, although I am still working on this to find a potentially neater solution.

Week 3 (8 June - 15 June)

  • Opened issue fsspec/kerchunk#176 regarding the need to run consolidate again after postprocess when writing output to json
  • Finalised ERA5-pds workflow and opened virtual dataset using kerchunk sidecar file
  • Set up notebook to explore kerchunks handling of fill_value

Week 4 (15 - 22 June)

  • Made pull request fsspec/kerchunk#180 to run postprocess before consolidate, which has now been merged into kerchunk repo
  • Made comment fsspec/kerchunk#177 (comment) in the fill_value vs _FillValue saga, to clarify that xarray only considers fill_value when opening from zarr
  • Explored methods to speed up MultiZarrtoZarr by running in parrallel and opened fsspec/kerchunk#182 regarding this

Week 5 (22 - 29 June)

Week 6 (29 - 6 July)

Week 7 (6 - 13 July)

Week 8 (13 - 20 July)

  • Expanded the prose and descriptions of the tutorials
  • Renamed the original Kerchunk tutorial to quick start and added the ERA5 tutorial to the kerchunk docs in fsspec/kerchunk#193
  • Had an initial go at adding a convenince function to merge variables to existing datasets in fsspec/kerchunk#196

Week 9 (20 - 27 July)

Week 10 (27 - 03 August)

  • Updated HRRR to utilise the new scan_grib module. And updated the case study in the Kerchunk docs to match this fsspec/kerchunk#206
  • investigated modifying combine to accomodate a list of lists input from scan_grib but decided the user should instead be writing each grib message as an individual json which the above case study now reflects. Still to check if changing this slightly to us fs.cat() could provide a speed up. https://github.com/fsspec/kerchunk/compare/main...peterm790:kerchunk:grib2_combine

Week 11 (03 - 10 August)

  • created a script to generate a dashboard from the ERA5 kerchunked data. This however is not running very smoothly and I suspec may be to do with the very small chunk sizing in the origin Netcdf files.
  • Updated the ERA5 tutorial to instead be native restructured text and no longer rely on pandoc. This is now in a new PR fsspec/kerchunk#208 and the older PR has been closed.
  • Experimented with a way to open the range of HRRR grib messages in an xarray datatree. This works to some extent but definitely still needs some work. https://gist.github.com/peterm790/b844fe0410d399f9ad8658377c744149
  • Modified LiveOcean forecast reference update script to utilise etags to monitor file changes. #6 (comment)

Week 12 (10 - 17 August)

Week 13 (17 - 24 August)

Week 14 (24 - 31 August)

Week 15 (31 - 07 September)

  • Completed a final article highlighting the functionality of kerchunk as well as describing the outcomes of the ERA5 and HRRR tutorials I set up. This article will soon be published on the Pangeo Medium Site