pipeline_gen

TLDR

Avoid the regret of building Jupyter notebooks that end up in production, pipelines all the way down for the win!

Goals

Give Data Scientists a better way to start production data apps.
Give Data Engineer better material for productionalization

What's the point?

It's easy to implement ideas in jupyter, but it's difficult to implement Jupyter notebooks in production. Produciton data apps need supportable and resilient data pipeline foundations. Pipeline_gen helps you start your project with pipelines in mind.

Why Should You Care?

Depends on who you are:

just starting out in Data Science? Do you want to make a good impression on Data Engineers during your technical interviews? you set yourself apart by having at least breifly considered what the impacts of your future efforts may have on their weekends.
seasoned data scientist?: Hopefully you're already developing your EDA and model code with pipelines in mind. Pipeline_gen forces / guides you to develop code that can be more easily scaled up by the data engineering team.
seasoned data engineer? Pipeline_Gen shouldn't be the tool for your production pipelines, but if provided to your data scientists, it will guide them towards producing more deployable and scalable code, making your future self happier :)
full stack data scientist? (aka data scientist and data engineer) - you can productionalize this under limited situations, and maybe hold you off until your employer can hire a real data engineer.

Design Principles

Seprate the orchestration from the data processing Airflow by default expects to run code on the Airflow python itself. It's better to perform all data processing on a docker image designed and built for that purpose. Airflow's DockerOperator conveniently provides a great pairing of the orchestrator and the pipeline environment.
Customize Sparingly Do not add what is not critical to a basic data pipeline. The code should deviate as little as possible from the most basic implementation of airflow
*Steer the user toward good data pipelining practices such as atomicity and idemportency
Keep the orchestrator out of the way Airflow is an extremely powerful and flexible tool, but it can be daunting to new users who haven't considered the data engineering implications of data science deployemnt.
Make code development easy with Jupyter The application spins up a local Jupyter Environment powered by the pipeline
Develop with Ops in mind Use the local Airflow to deploy, monitor, and troubleshoot your pipeline from inception

Motivation

It's not easy to extract, refine, and transform raw data into production ML models. It's much worse to keep models updated as data evolves over time. It is much easier to start with the pipeline orchestration in mind. You don't want to be in a position of having to rerun a Jupyter notebook to refresh production ML models.

I'd love to know if you have a better way to train models with airflow that doesn't take this approach!

What this project is not

Airflow has a world of capability that is not used in this project. It can be deployed much more reliably than what is offered here. We're literally using a small fraction of it's capabilities with this project. Importantly, this is by design. This project represents the step between a simple Jupyter EDA and a high availability, high throughput enterprise datapipeline.

Prerequisites

To use this template, you'll need:

Docker installed on your system
Git for cloning the repository

Quick Start

Clone the repository

git clone https://github.com/yourusername/airflow-orchestration-environment.git
cd airflow-orchestration-environment

Build the Docker image

docker build -t yourusername/airflow-orchestration-environment .

Start the Airflow environment
```
docker-compose up -d
```
This command will start the Airflow web server, scheduler, and all necessary components in separate containers.
Access the Airflow web interface

Open your browser and navigate to http://localhost:8080. You should see the Airflow web interface with no pipelines.

Usage

Creating a New Data Pipeline

Create a new Python script in the dags folder following the naming convention your_dag_name.py.

Define your data pipeline using Airflow's Directed Acyclic Graph (DAG) and task objects. For example:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def print_hello():
    print("Hello from your first task!")

default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "email_on_failure": False,
    "email_on_retry": False,
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}

dag = DAG(
    "hello_world",
    default_args=default_args,
    description="A simple hello world DAG",
    schedule_interval=timedelta(days=1),
    start_date=datetime(2023, 4, 16),
    catchup=False,
)

t1 = PythonOperator(
    task_id="print_hello",
    python_callable=print_hello,
    dag=dag,
)

Save the file, and the new pipeline will automatically appear in the Airflow web interface.

Updating an Existing Data Pipeline

To update an existing pipeline, simply modify the corresponding Python script in the dags folder and save your changes. Airflow will automatically update the pipeline in the web interface.

Updating the Application

To modify the app (add packages, update port numbers, etc), simply make the changes (protip: make only 1 change at a time), then restart the app:

docker-compose down; ./clean_docker.sh ; ./build_pipeline_image.sh ; docker-compose up

Contributing

Want to add tooling that will help monitor and deploy models, such as Tensorboard, Weights and Biases, and maybe Prometheus and Grafana.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dags		dags
tests		tests
.gitignore		.gitignore
Dockerfile.airflow		Dockerfile.airflow
Dockerfile.pipeline		Dockerfile.pipeline
README.md		README.md
airflow.cfg		airflow.cfg
bounce.sh		bounce.sh
build_pipeline_image.sh		build_pipeline_image.sh
clean_docker.sh		clean_docker.sh
dag_example.JPG		dag_example.JPG
docker-compose.yml		docker-compose.yml
requirements_for_airflow.txt		requirements_for_airflow.txt
requirements_for_pipeline.txt		requirements_for_pipeline.txt
ridge-from-cuml.ipynb		ridge-from-cuml.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pipeline_gen

TLDR

Goals

What's the point?

Why Should You Care?

Design Principles

Motivation

What this project is not

Prerequisites

Quick Start

Usage

Creating a New Data Pipeline

Updating an Existing Data Pipeline

Updating the Application

Contributing

About

Releases

Packages

Languages

lunchforsoup/airflow_datascience_pipeline

Folders and files

Latest commit

History

Repository files navigation

pipeline_gen

TLDR

Goals

What's the point?

Why Should You Care?

Design Principles

Motivation

What this project ** is not **

Prerequisites

Quick Start

Usage

Creating a New Data Pipeline

Updating an Existing Data Pipeline

Updating the Application

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

What this project is not

Packages