Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ContrastiveDataset iterator and ContrastiveDistillationDataset bug. #578

Open
DemirTonchev opened this issue Dec 21, 2024 · 0 comments · May be fixed by #579
Open

ContrastiveDataset iterator and ContrastiveDistillationDataset bug. #578

DemirTonchev opened this issue Dec 21, 2024 · 0 comments · May be fixed by #579

Comments

@DemirTonchev
Copy link
Contributor

DemirTonchev commented Dec 21, 2024

(updated)

ContrastiveDataset

ContrastiveDataset by default generates and builds the whole positive and negative examples internally in lists. Then in trainer.py

dataset = Dataset.from_list(list(data_sampler))

from_list method is used to create the dataset for training. This is fine with smaller datasets which I recognize is the whole point of the setfit method, yet if you want to build bigger dataset for whatever reason this can blow your RAM memory (as I have experienced). It would be best if you can build the dataset from a generator Dataset.from_generator While the current __iter__ method is correct the internal data that it uses is eagerly generated on __init__ . It would be best if the generator is truly "lazy".

Other question I am not sure why is what is the reason that replacement is set to True by default? Although there are some cases where this makes sense the matrix picture in the sampling guide shows the diagonal as empty (ie I wont expect the library to do the opposite by default - https://huggingface.co/docs/setfit/v1.1.0/en/conceptual_guides/sampling_strategies) Also there is no way to control that.

def shuffle_combinations(iterable: Iterable, replacement: bool = True) -> Generator:

Using this setting we get pairs such as:
("The movie was awesome", "The movie was awesome") as a positive pair in the training process.

ContrastiveDistillationDataset

ContrastiveDistillationDataset inherits from ContrastiveDataset but set its own

self.sentence_labels = list(enumerate(self.sentences))

by looking at the code you would expect that the pairs are generated using these self.sentence_labels but what happens is the super().__init__ goes first and there it uses the sentence_labels from ContrastiveDataset

self.sentence_labels = list(zip(self.sentences, self.labels))

and this results in creating the pairs using the function defined in ContrastiveDistillationDataset
def generate_pairs(self) -> None:

but with the sentence_labels defined by ContrastiveDataset.
This results in creating pairs of sentences that are always selecting the first row first column element from the cos_matrix as label.
this is what you expect by looking at the code

{"sentence_1": text_one, "sentence_2": text_two, "label": self.cos_sim_matrix[id_one][id_two]}

But because the init method runs with the sentence_labels generated from the parent you always get:

{"sentence_1": text_one, "sentence_2": text_two, "label": self.cos_sim_matrix[0][0]}

because the labels in the parent are set as 0 for each sentence. Then after the init you set self.sentence_labels = list(enumerate(self.sentences)) but this is never used as the pairs are already generated, very nasty and easily overlooked bug. This unfortunately makes ContrastiveDistillationDataset unusable.

@DemirTonchev DemirTonchev changed the title ContrastiveDataset generates pair of same sentence by default. ContrastiveDataset iterator and ContrastiveDistillationDataset bug. Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant