ContrastiveDataset iterator and ContrastiveDistillationDataset bug. #578

DemirTonchev · 2024-12-21T21:43:17Z

(updated)

ContrastiveDataset

ContrastiveDataset by default generates and builds the whole positive and negative examples internally in lists. Then in trainer.py

setfit/src/setfit/trainer.py

Line 605 in 146c7c9

dataset = Dataset.from_list(list(data_sampler))

from_list method is used to create the dataset for training. This is fine with smaller datasets which I recognize is the whole point of the setfit method, yet if you want to build bigger dataset for whatever reason this can blow your RAM memory (as I have experienced). It would be best if you can build the dataset from a generator Dataset.from_generator While the current __iter__ method is correct the internal data that it uses is eagerly generated on __init__ . It would be best if the generator is truly "lazy".

Other question I am not sure why is what is the reason that replacement is set to True by default? Although there are some cases where this makes sense the matrix picture in the sampling guide shows the diagonal as empty (ie I wont expect the library to do the opposite by default - https://huggingface.co/docs/setfit/v1.1.0/en/conceptual_guides/sampling_strategies) Also there is no way to control that.

setfit/src/setfit/sampler.py

Line 15 in 146c7c9

    
           def shuffle_combinations(iterable: Iterable, replacement: bool = True) -> Generator:

Using this setting we get pairs such as:
("The movie was awesome", "The movie was awesome") as a positive pair in the training process.

ContrastiveDistillationDataset

ContrastiveDistillationDataset inherits from ContrastiveDataset but set its own

setfit/src/setfit/sampler.py

Line 170 in 146c7c9

self.sentence_labels = list(enumerate(self.sentences))

by looking at the code you would expect that the pairs are generated using these self.sentence_labels but what happens is the super().__init__ goes first and there it uses the sentence_labels from ContrastiveDataset

setfit/src/setfit/sampler.py

Line 63 in 146c7c9

self.sentence_labels = list(zip(self.sentences, self.labels))

and this results in creating the pairs using the function defined in ContrastiveDistillationDataset

setfit/src/setfit/sampler.py

Line 178 in 146c7c9

def generate_pairs(self) -> None:

but with the sentence_labels defined by ContrastiveDataset.
This results in creating pairs of sentences that are always selecting the first row first column element from the cos_matrix as label.
this is what you expect by looking at the code

{"sentence_1": text_one, "sentence_2": text_two, "label": self.cos_sim_matrix[id_one][id_two]}

But because the init method runs with the sentence_labels generated from the parent you always get:

{"sentence_1": text_one, "sentence_2": text_two, "label": self.cos_sim_matrix[0][0]}

because the labels in the parent are set as 0 for each sentence. Then after the init you set self.sentence_labels = list(enumerate(self.sentences)) but this is never used as the pairs are already generated, very nasty and easily overlooked bug. This unfortunately makes ContrastiveDistillationDataset unusable.

The text was updated successfully, but these errors were encountered:

DemirTonchev changed the title ~~ContrastiveDataset generates pair of same sentence by default.~~ ContrastiveDataset iterator and ContrastiveDistillationDataset bug. Dec 25, 2024

DemirTonchev linked a pull request Dec 25, 2024 that will close this issue

Refactor ContrastiveDataset and ContrastiveDistillationDataset #579

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ContrastiveDataset iterator and ContrastiveDistillationDataset bug. #578

ContrastiveDataset iterator and ContrastiveDistillationDataset bug. #578

DemirTonchev commented Dec 21, 2024 •

edited

Loading

ContrastiveDataset iterator and ContrastiveDistillationDataset bug. #578

ContrastiveDataset iterator and ContrastiveDistillationDataset bug. #578

Comments

DemirTonchev commented Dec 21, 2024 • edited Loading