You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using s3tokenizer with PyTorch's DistributedSampler in a distributed inference setup, the output files contain duplicate keys, and the total number of keys is always a multiple of world_size.
This issue arises because DistributedSampler in PyTorch, by default, pads the dataset by repeating samples if drop_last=False, to ensure the total dataset size is divisible by world_size. This behavior is implemented in the DistributedSampler source code, where extra samples are added if len(indices) % num_replicas != 0, causing certain samples to appear multiple times across different ranks.
ifnotself.drop_last:
# add extra samples to make it evenly divisiblepadding_size=self.total_size-len(indices)
ifpadding_size<=len(indices):
indices+=indices[:padding_size]
else:
indices+= (indices*math.ceil(padding_size/len(indices)))[
:padding_size
]
The text was updated successfully, but these errors were encountered:
When using
s3tokenizer
with PyTorch'sDistributedSampler
in a distributed inference setup, the output files contain duplicate keys, and the total number of keys is always a multiple ofworld_size
.This issue arises because
DistributedSampler
in PyTorch, by default, pads the dataset by repeating samples ifdrop_last=False
, to ensure the total dataset size is divisible byworld_size
. This behavior is implemented in theDistributedSampler
source code, where extra samples are added iflen(indices) % num_replicas != 0
, causing certain samples to appear multiple times across different ranks.The text was updated successfully, but these errors were encountered: