An example of a simple stream and batch query
made by implementing PySpark, python API of Apache Spark™, queries on a Lightning flash dataset collected from NOAA's GLM.
Uses Apache Parquet file format as the storage backend and Dagster Software-Defined Assets to orchestrate the batch/stream processing pipeline.
Blog post: Lightning Streams: PySpark Batch & Streaming Queries
![]() |
---|
Dagster + PySpark + Parquet |
First make sure, you have the requirements installed, this can be installed from the project directory via pip's setup command:
pip install . # =< python3.11
Run the command to start the dagster orchestration framework:
dagster dev # Start dagster daemon and dagit ui
The dagster daemon is required to start the scheduling, from the dagit ui, you can run and monitor the data assets.
ETL pipe data assets:
Source
: extracts NOAA GOES-R GLM file datasets from AWS s3 bucket.Transformations
: transforms dataset into time series csv.Sink
: loads dataset to persistant storage.
Sink loading process refactored to use pyspark
(batch and structured streaming queries) and parquet as the storage backend.
![]() |
---|
ETL Data asset group |
Blog post: Exploratory Data Analysis with Lightning Streaming Pipeline
![]() |
---|
Materializing Lightning clustering pipeline |
Ingests the data needed based on specified time window: start and end dates.
ingestor
: Composed ofextract
,transform
, andload
data assets.extract
: downloads NOAA GOES-R GLM netCDF files from AWS s3 buckettransform
: converts GLM netCDF into time and geo series CSVsload
: loads CSVs to a local backend, persistant duckdb
Performs grouping of the ingested data by implementing K-Means clustering algorithm.
![]() |
---|
Visual of clustering process |
preprocessor
: prepares the data for cluster model, clean and normalize the data.kmeans_cluster
: fits the data to an implementation of k-means cluster algorithm.silhouette_evaluator
: evaluates the choice of 'k' clusters by calculating the silhouette coefficient for each k in defined range.elbow_evaluator
: evaluates the choice of 'k' clusters by calculating the sum of the squared distance for each k in defined range.
![]() |
---|
Displaying Clusering analysis data assets |
![]() |
---|
Lightning clustering map |
Use the following command to run tests:
pytest