forked from microsoft/Megatron-DeepSpeed
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] Entmax loss #3
Open
bpopeters
wants to merge
26
commits into
main
Choose a base branch
from
entmax-loss
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…my assumptions of dimensions are correct)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request adds support for entmax loss for training GPT models. This can be done through the
--loss_function
argument, which supports the following values: 'cross_entropy' (default), 'entmax15', 'sparsemax', and 'entmax_bisect'. 'entmax15' and 'sparsemax' make use of an additional--entmax-topk
argument which sensibly defaults to 512. If using 'entmax_bisect', the alpha can be specified with--entmax-alpha
(defaulting to 1.5) and--entmax-n-iter
(defaulting to 30). Note that these flags work only for GPT models without pipeline parallelism (supporting other models should be easy, although I doubt anyone is interested right now; I don't know what would be required for pipeline parallelism).I've run some quick tests with entmax15 on artemis with a very small (i.e. 3-layer, 128dim) model on {1, 2, 4} GPUs. Performance is quite a bit worse than cross entropy, but I believe this is (at least partially) an artifact of how small the model was -- the output layer and loss computation probably dominated the runtime in a away that it would not with a more reasonably-sized model. However, my attempts to train bigger models have been unsuccessful because memory usage is shockingly high (not just with entmax loss, also with cross entropy).
Note also that entmax loss does not currently support sequence-parallel loss computation. I'm not sure if this is relevant for our case (meaning, scaling up to 1B parameter models). However, it shouldn't be difficult to implement if we need to.
Before merging, we should probably think more about these performance issues.