Output Contains Duplicate Keys When Using DistributedSampler in Distributed Inference #13

yfyeung · 2024-11-05T09:30:36Z

When using s3tokenizer with PyTorch's DistributedSampler in a distributed inference setup, the output files contain duplicate keys, and the total number of keys is always a multiple of world_size.

This issue arises because DistributedSampler in PyTorch, by default, pads the dataset by repeating samples if drop_last=False, to ensure the total dataset size is divisible by world_size. This behavior is implemented in the DistributedSampler source code, where extra samples are added if len(indices) % num_replicas != 0, causing certain samples to appear multiple times across different ranks.

        if not self.drop_last:
            # add extra samples to make it evenly divisible
            padding_size = self.total_size - len(indices)
            if padding_size <= len(indices):
                indices += indices[:padding_size]
            else:
                indices += (indices * math.ceil(padding_size / len(indices)))[
                    :padding_size
                ]

The text was updated successfully, but these errors were encountered:

xingchensong · 2024-11-05T13:55:27Z

welcome for pr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output Contains Duplicate Keys When Using DistributedSampler in Distributed Inference #13

Output Contains Duplicate Keys When Using DistributedSampler in Distributed Inference #13

yfyeung commented Nov 5, 2024 •

edited

Loading

xingchensong commented Nov 5, 2024

Output Contains Duplicate Keys When Using DistributedSampler in Distributed Inference #13

Output Contains Duplicate Keys When Using DistributedSampler in Distributed Inference #13

Comments

yfyeung commented Nov 5, 2024 • edited Loading

xingchensong commented Nov 5, 2024

yfyeung commented Nov 5, 2024 •

edited

Loading