This repository contains resources for creating production-grade ML inference processors. Models are expected to be hosted on the EOTDL as Q2+ models (ONNX models + STAC metadata with the MLM extension). The following features are included:
- CPU/GPU inference
- Docker
- Kubernetes
- Auto-scaling
- Load testing
- Batch & Online processing
- Monitoring & Alerting
- Data drift detection
Future features will include:
- Security & Safety
- Testing
To run the default API with Docker
, you can use the following command:
# cpu
docker run -p 8000:80 -e EOTDL_API_KEY=<eotdl_api_key> earthpulseit/ml-inference
# gpu
docker run --gpus all -p 8000:80 -e EOTDL_API_KEY=<eotdl_api_key> earthpulseit/ml-inference-gpu
You can get your EOTDL API key for free by signing up at EOTDL and creating a new token in your profile.
You can also use the sample k8s
manifests to deploy the API to a Kubernetes cluster.
kubectl apply -f k8s/deployment.yaml
By default, requests to the API are processed sequentially. You can change this behavior by setting the BATCH_SIZE
and BATCH_TIMEOUT
environment variables.
BATCH_SIZE
: Maximum number of requests to process in a single batch.BATCH_TIMEOUT
: Maximum time (in seconds) to wait before processing an incomplete batch.
# cpu
docker run -p 8000:80 -e EOTDL_API_KEY=<eotdl_api_key> -e BATCH_SIZE=<batch_size> -e BATCH_TIMEOUT=<batch_timeout> earthpulseit/ml-inference
# gpu
docker run --gpus all -p 8000:80 -e EOTDL_API_KEY=<eotdl_api_key> -e BATCH_SIZE=<batch_size> -e BATCH_TIMEOUT=<batch_timeout> earthpulseit/ml-inference-gpu
This setting is particularly useful if the number of concurrent requests exceeds the time it takes to run inference with a model on your hardware.
In order to monitor the API, you can use Prometheus and Grafana. For this case we recommend using the docker-compose.minitoring.yaml
file or the corresponding k8s
deployments.
docker compose -f docker-compose.monitoring.yaml -f docker-compose.cpu.yaml up
In Grafana:
- Add Prometheus as a data source (URL: http://prometheus:9090)
- Import dashboard for FastAPI monitoring (you can start with dashboard ID 18739 from Grafana's dashboard marketplace)
This setup will give you:
- Basic metrics like request count, latency, and status codes
- System metrics like CPU and memory usage
- Custom metrics that you can add later
- Visualization and alerting capabilities through Grafana
You can further customize the monitoring by:
- Adding custom metrics in your FastAPI code
- Creating custom Grafana dashboards
- Setting up alerts in Grafana
- Adding more Prometheus exporters for system metrics
Some custom metrics included are:
model_counter
: Number of models requestedmodel_error_counter
: Number of errors in model inferencemodel_inference_duration
: Time spent processing inference requestsmodel_inference_batch_size
: Number of images in the batchmodel_inference_timeout
: Number of inference requests that timed out
You can set alerts in Grafana for these metrics by going to Alerting > Alert Rules
in the Grafana dashboard. Some examples are:
rate(model_inference_errors_total[1m]) > 10
: Alert if more than 10 errors in the last minutehistogram_quantile(0.95, rate(model_inference_duration_seconds[1m])) > 10
: Alert if more than 95% of the requests take more than 10 seconds to process- You can add more alerts using
PromQL
queries.
TODO: It may be interesting to create a custom dashboard with the included metrics and share it through Grafana's dashboard marketplace.
Alternatively, you can use AlertManager
to send notifications via email, Slack, etc. (out of the scope of this repository).
Data drift detection is an important monitoring practice in ML systems that helps identify when the statistical properties of your production data differ significantly from the training data. This difference can lead to model performance degradation over time.
- Feature drift: Changes in the input data distribution (e.g., image pixel values, color distributions)
- Label drift: Changes in the target variable distribution
- Concept drift: Changes in the relationship between features and target
Common Causes are seasonal changes, changes in data collection methods, population shifts, hardware/sensor changes, and data quality issues.
You can use the DRIFT_DETECTION
environment variable to enable drift detection. This will add a DriftDetector
for each model. By default, the drift detector will monitor the input size for a given number of requests and report mean values with Prometheus (which can be visualized in Grafana and used to set alerts). Feel free to modify the src/drift.py
file to monitor other metrics or to implement a different drift detection algorithm.
This repository offers functionality for creating production-grade APIs to perform inference on ML models.
To develop the api, run
# cpu support
docker-compose -f docker-compose.cpu.yaml up
# gpu support
docker-compose -f docker-compose.gpu.yaml up
You can try the api with the interactive documentation at http://localhost:8000/docs
.
Build the docker image:
# cpu
docker build -t <username>/<image-name>:<tag> api
# gpu
docker build -t <username>/<image-name>:<tag> -f api/Dockerfile.gpu api
Use your dockerhub username and a tag for the image.
Push to docker hub:
docker push <username>/<image-name>:<tag>
You will need to login to dockerhub with your credentials before pushing the image.
You can run the image with:
# cpu
docker run -p 8000:8000 <username>/<image-name>:<tag>
# gpu
docker run --gpus all -p 8000:8000 <username>/<image-name>:<tag>
Start minikube:
minikube start
add metrics server if you want to use autoscaling
minikube addons enable metrics-server
Create secrets
kubectl create configmap ml-inference-config --from-env-file=.env
Deploy api to cluster:
kubectl apply -f k8s/deployment.yaml
Change the image name in the deployment file to the one you created.
Port forward to access the api:
kubectl port-forward service/ml-inference-service 8000:80
Get api logs
kubectl logs -f deployment/ml-inference
You can autoscale your API with the following command:
kubectl apply -f k8s/hpa.yaml
Modify the manifest to adjust to your needs.
You can test the autoscaling with a load test with locust
.
You can monitor the API with Prometheus and Grafana.
kubectl apply -f k8s/prometheus-config.yaml
kubectl apply -f k8s/monitoring.yaml
Port forward to access Prometheus and Grafana:
kubectl port-forward service/prometheus-service 9090:9090
kubectl port-forward service/grafana-service 3000:3000
Connect Prometheus to Grafana with http://prometheus-service:9090
as the Prometheus instance.
Minikube gpu support is limited. Following https://minikube.sigs.k8s.io/docs/tutorials/nvidia/ does not seem to work.
- Deployment: install nvidia device plugin in k8s nodes and add
resources: limits: nvidia.com/gpu: 1
to the deployment manifest. No more GPUs than available on nodes can be used. - Autoscaling: expose gpu usage as custom metric (prometheus + node exporter + prometheus adapter) and use it in the hpa manifest.
To run the api in a cloud kubernetes cluster, you should follow the same steps as for minikube taking the following into account:
- Change the service type to
LoadBalancer
orClusterIP
in the deployment manifests service section. - Use a cloud provider that supports gpu nodes (if you want to use the gpu version).
- Use
ingress.yaml
to expose the api to the internet instead of port forwarding.