Replies: 1 comment
-
Great effort. Thank you for creating such an amazing repo and for this detailed introduction. i will start sharing my best practice for TPUv4-8 , TPUv-32, and TPUv4-64 once the 0.1dev release fixed. This way our examples will be consistence. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In this thread, we aim to discuss and document the best practices for sharding configurations on TPUs and GPUs, specifically tailored for libraries like EasyDeL, which leverages sharding methods such as Data Parallel (DP), Fully Sharded Data Parallel (FSDP), Tensor Parallel (TP), and Sequence Parallel (SP). EasyDeL also allows for custom sharding methods to be defined via axis annotations.
Importance of Sharding Configurations
Sharding configurations can significantly impact training speed, efficiency, and model performance, especially on hardware like TPUs and GPUs. By fine-tuning the sharding strategies, developers and researchers can maximize throughput, reduce memory overhead, and balance workloads across multiple devices.
Sharding Methods Overview
TPU and GPU Sharding Observations
Below are some insights into how different sharding methods perform on TPUs and GPUs based on experimentation and metrics:
TPU Observations
GPU Observations
Best Practices
TPU Specific:
GPU Specific:
Proposed Next Steps
Feel free to share your configurations, metrics, and observations!
Beta Was this translation helpful? Give feedback.
All reactions