-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing Pipeline Parallelism with LLaMA Models and Utilizing deepspeed for Execution? #7
Comments
Hi, thanks for your interests. For the reason why do not use the wrappers, it is because the wrappers implemented in the ChatGLM repo require the original model as a parameter to initialize the pipe layers. I understand that by this way you do not need to first convert the HF weights into deespeed format. But you need to load the complete weights at each rank, which will cause OOM (CPU memory, imagine that you are trying to load 70B models using 8 processes, then you have 8 copies). My implementation indeed works like this: (1) use the pre-trained config to initialize the specific layer (2) load the correpending weights from the disk (this is managed by deepspeed), so that you only need one complete copy of the weights. For the launch, yes, simply call |
Thank you for your response. The glm's implementation of wrapping is equivalent to loading all parameters first (each process will load the whole parameters to memory) and then allocating the parameters to the GPUs(e.g layers 0-3 onto GPU0 and 4-7 to GPU1) . This approach may be suitable for smaller models, but for particularly large models (70B), using your approach, which involves configuring which layers need to load which parameters first, and then having deepspeed load the corresponding parameters from disk directly into the corresponding layers(to GPU memory), is my understanding above correct? |
Yes. The two implementations should not have too much difference. But I remember that I also encountered some problems while using his implementation. Previously I raised one issue. I think may because the different internal implementations between Llama and ChatGLM. For simplicity, maybe you can simply use my approach, but only change the |
Hi,
I'm currently experimenting with fine-tuning some small LLaMA models (LLaMA2-7b) and I'm interested in utilizing pipeline parallelism. However, there are few examples available for reference. I also looked into the repository for chatGLM with pipeline parallelism that you listed below, chatglm finetuning I tried to implement the pipeline layers similar to chatGLM3. Unfortunately, I encountered several failures.
I noticed that you have a set of wrappers in llama_ds_mp_wrap.py that resemble the reference implementation, but it seems they are not used in the end. Is this approach not feasible?
Consequently, I turned to LLaMA and attempted to establish a runnable demo with pipeline parallelism. I would like to know if using the deepspeed command to run trainer_base_ds_mp.py is the correct way to execute this code.
Thanks in advance.
The text was updated successfully, but these errors were encountered: