Skip to content

Latest commit

 

History

History
110 lines (82 loc) · 8.38 KB

README.md

File metadata and controls

110 lines (82 loc) · 8.38 KB

Awesome AI Infrastructure Awesome Lists

Buy Me A Coffee   Ko-Fi   PayPal   Stripe

A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.

Contents

Distributed Training

  • Horovod - A distributed deep learning training framework for TensorFlow, Keras, and PyTorch.
  • Ray - A framework for building scalable distributed applications, including distributed AI and reinforcement learning.
  • PyTorch Distributed - Tools and libraries for distributed training in PyTorch.
  • DeepSpeed - A deep learning optimization library that makes distributed training easy and efficient.
  • MPI for Machine Learning - Using the Message Passing Interface (MPI) standard for distributed machine learning.

Model Serving and Deployment

  • TensorFlow Serving - A flexible, high-performance serving system for machine learning models.
  • TorchServe - A model serving framework for PyTorch, providing fast and efficient model deployment.
  • NVIDIA Triton Inference Server - A scalable model serving platform supporting multiple frameworks.
  • ONNX Runtime - A cross-platform, high-performance scoring engine for serving ONNX models.
  • Seldon Core - An open-source platform for deploying and monitoring machine learning models on Kubernetes.
  • KFServing (KServe) - A Kubernetes-based model serving solution as part of the Kubeflow project.

MLOps and Automation

  • MLflow - An open-source platform for managing the end-to-end machine learning lifecycle.
  • Kubeflow - A platform for orchestrating machine learning workflows on Kubernetes.
  • DVC (Data Version Control) - A tool for version control and reproducibility in machine learning projects.
  • ZenML - An extensible MLOps framework for creating portable, production-ready machine learning pipelines.
  • Airflow - A platform for orchestrating complex workflows, commonly used in machine learning pipelines.
  • Metaflow - A human-centric framework for building and managing real-life data science projects, developed by Netflix.

Data Management

  • Delta Lake - An open-source storage layer that brings reliability to data lakes.
  • Apache Hudi - A data management framework that simplifies incremental data processing and streaming analytics.
  • Feast - An open-source feature store for managing and serving machine learning features.
  • Great Expectations - A tool for data validation and testing in machine learning workflows.
  • LakeFS - An open-source data versioning platform for managing data lakes.

Optimization Tools

  • NVIDIA TensorRT - A high-performance deep learning inference optimizer and runtime.
  • Apache TVM - A deep learning compiler stack for optimizing models on various hardware backends.
  • Intel OpenVINO - A toolkit for optimizing and deploying AI inference on Intel hardware.
  • OctoML - An AI model optimization platform for efficient deployment on edge and cloud.
  • Quantization Aware Training (QAT) - Tools for optimizing model performance through quantization.

Infrastructure as Code

  • Terraform - A tool for building, changing, and versioning infrastructure safely and efficiently.
  • Pulumi - Infrastructure as code for deploying and managing cloud infrastructure using programming languages.
  • Ansible - An open-source automation tool for provisioning and managing infrastructure.
  • AWS CloudFormation - A service for automating AWS resource deployment and management.
  • Google Deployment Manager - An infrastructure management tool for Google Cloud Platform.

Cloud Platforms

  • AWS SageMaker - A comprehensive platform for building, training, and deploying machine learning models on AWS.
  • Google AI Platform - Google Cloud’s integrated environment for AI development and deployment.
  • Azure Machine Learning - A cloud-based platform for training, deploying, and managing machine learning models.
  • IBM Watson Studio - A suite of tools for data science, machine learning, and AI model development.
  • Paperspace Gradient - A cloud platform for developing, training, and deploying machine learning models.

Learning Resources

Books

  • Machine Learning Engineering by Andriy Burkov - A book on building scalable machine learning infrastructure.
  • Building Machine Learning Powered Applications by Emmanuel Ameisen - A guide to building robust ML applications in production.
  • Designing Data-Intensive Applications by Martin Kleppmann - A comprehensive guide to building scalable and reliable data systems.
  • MLOps: Data Science in Production by Mark Treveil and The Dotscience Team - A book on best practices for MLOps and model deployment.
  • Reliable Machine Learning by Cathy Chen - A book on creating resilient machine learning infrastructure.

Community

Contribute

Contributions are welcome!

License

CC0