-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM optimization documentation fixes and updates. #28212
LLM optimization documentation fixes and updates. #28212
Conversation
@kblaszczak-intel, @tsavina, can you take a look? |
For instance the 7 billion parameter Llama 2 model can be reduced | ||
from about 25GB to 4GB using 4-bit weight compression. | ||
For instance the 8 billion parameter Llama 3 model can be reduced | ||
from about 16.1 GB to 4.8 GB using 4-bit weight quantization on top of bfloat16 model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from about 16.1 GB to 4.8 GB using 4-bit weight quantization on top of bfloat16 model. | |
from about 16.1 GB to 4.8 GB using 4-bit weight quantization on top of a bfloat16 model. |
compression may result in more accuracy reduction than with larger models. | ||
Therefore, weight compression is recommended for use with LLMs only. | ||
|
||
LLMs and other GenAI models that require |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLMs and other GenAI models that require | |
LLMs and other generative AI models that require |
NNCF allows stacking the supported optimization methods. For example, AWQ, Scale Estimation | ||
and GPTQ methods can be enabled all together to achieve better accuracy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NNCF allows stacking the supported optimization methods. For example, AWQ, Scale Estimation | |
and GPTQ methods can be enabled all together to achieve better accuracy. | |
NNCF enables you to stack the supported optimization methods. For example, AWQ, Scale Estimation | |
and GPTQ methods may be enabled all together to achieve better accuracy. |
No description provided.