`weather-mv` will ingest data into BQ from Zarr much faster. #357

alxmrs · 2023-07-11T17:14:14Z

weather-mv bq's previous Zarr ingestion system only used one worker. This PR uses Xarray-Beam for Zarr ingestion, in order to distributed xr.Dataset chunks across beam workers. This improves ingestion into BQ.

Outstanding issues: I can't find a way to incrementally load rows into BQ from Zarr. While I've used windowing on fixed intervals to break up a large ingestion job into smaller parts, it seems like the actual writing to BQ gets stuck in a reshuffle step within the WriteToBigQuery transform. In this PR or a future PR, let's try to find a way to incrementally write rows to BQ once they've been processed, instead of having to wait for the entire dataset to be processed. CC: @dabhicusp.

Found a couple of errors with loading a Zarr dataset into BigQuery.

mahrsee1997

Overall LGTM.

@dabhicusp / @DarshanSP19, could you please provide the benchmarking results for the current main branch as compared to this specific branch?

mahrsee1997 · 2023-07-17T06:45:30Z

weather_mv/loader_pipeline/bq.py

+                paths
+                | 'OpenChunks' >> xbeam.DatasetToChunks(ds, chunks)
+                | 'ExtractRows' >> beam.FlatMapTuple(self.chunks_to_rows)
+                | 'Window' >> beam.WindowInto(window.FixedWindows(60))


If incorporating these steps (window) enables real-time data ingestion into BigQuery (in batch jobs), then we should relocate these lines here.
Ref: #291

dabhicusp

LGTM @alxmrs.

DarshanSP19 · 2023-07-19T04:02:48Z

Data after running pipelines for mentioned cases.

File Size	main branch	mv-ba-fix-zarr (this branch)
~100Mb	58 min.	41min.
~1Gb	16h 30min.	1h 12min.

This changes are relatively fast than before for zarr batch ingestion.

alxmrs added 29 commits July 11, 2023 11:08

Fixed issues found loading Zarr into BQ.

b095cef

Found a couple of errors with loading a Zarr dataset into BigQuery.

Base weather-tools install requires gcsfs.

43f2374

Not normalized by default.

d6487b0

Parallel Zarr ingestion into BQ.

be077ff

Fix setup.py syntax error.

0226ccb

Fixing Zarr + Xarray-Beam support.

3c57acb

Added happy path unit test for parallel zarr reading in BQ.

14601e4

fix flake8 issues.

9eedf3f

Better whitespace.

88790a6

Adding open_ds kwargs to open zarr.

64343b9

Attempting to fix pickling issues.

3051917

Another attempt to fix pickling error, now in transform.

d955562

Experiment: is xbeam.open_zarr the issue?

b1870a5

adding engine=zarr.

5146f64

open_zarr --> open_dataset w/ engine.

62ed509

delete regrid

18eb433

Pinned Zarr version.

868a0cf

Hard coded current CL for docker image.

9b6329d

rm unnecessary delete.

5cb6111

Only recent years.

e4806db

All data w/ streaming inserts.

239cfde

Experiment: added windowing.

d9a9368

Documented timestamp_row fn.

333d98a

Self-review: Prepared changes for PR.

c8c82de

Small cleanup.

e3eda36

Remove debug isel.

cbd674c

Added types to to_rows().

c75ca93

Fixed flake8 lint errors.

1e8e4be

Better types for to_rows().

fa99904

alxmrs requested a review from dabhicusp July 11, 2023 17:26

alxmrs requested a review from mahrsee1997 July 11, 2023 17:26

dabhicusp added 2 commits July 13, 2023 22:35

Test updated and 'chunks' removed from zarr_kwargs

2cf9a28

Zarr version updated.

f2608e2

mahrsee1997 reviewed Jul 17, 2023

View reviewed changes

dabhicusp reviewed Jul 18, 2023

View reviewed changes

alxmrs requested a review from mahrsee1997 July 21, 2023 04:08

fredzyda approved these changes Aug 16, 2023

View reviewed changes

alxmrs merged commit 9cdf0e7 into main Aug 16, 2023

alxmrs deleted the mv-ba-fix-zarr branch August 16, 2023 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`weather-mv` will ingest data into BQ from Zarr much faster. #357

`weather-mv` will ingest data into BQ from Zarr much faster. #357

alxmrs commented Jul 11, 2023 •

edited

Loading

mahrsee1997 left a comment •

edited

Loading

mahrsee1997 Jul 17, 2023 •

edited

Loading

dabhicusp left a comment

DarshanSP19 commented Jul 19, 2023 •

edited

Loading

weather-mv will ingest data into BQ from Zarr much faster. #357

weather-mv will ingest data into BQ from Zarr much faster. #357

Conversation

alxmrs commented Jul 11, 2023 • edited Loading

mahrsee1997 left a comment • edited Loading

Choose a reason for hiding this comment

mahrsee1997 Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

dabhicusp left a comment

Choose a reason for hiding this comment

DarshanSP19 commented Jul 19, 2023 • edited Loading

`weather-mv` will ingest data into BQ from Zarr much faster. #357

`weather-mv` will ingest data into BQ from Zarr much faster. #357

alxmrs commented Jul 11, 2023 •

edited

Loading

mahrsee1997 left a comment •

edited

Loading

mahrsee1997 Jul 17, 2023 •

edited

Loading

DarshanSP19 commented Jul 19, 2023 •

edited

Loading