You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
from_list method is used to create the dataset for training. This is fine with smaller datasets which I recognize is the whole point of the setfit method, yet if you want to build bigger dataset for whatever reason this can blow your RAM memory (as I have experienced). It would be best if you can build the dataset from a generator Dataset.from_generator While the current __iter__ method is correct the internal data that it uses is eagerly generated on __init__ . It would be best if the generator is truly "lazy".
Other question I am not sure why is what is the reason that replacement is set to True by default? Although there are some cases where this makes sense the matrix picture in the sampling guide shows the diagonal as empty (ie I wont expect the library to do the opposite by default - https://huggingface.co/docs/setfit/v1.1.0/en/conceptual_guides/sampling_strategies) Also there is no way to control that.
by looking at the code you would expect that the pairs are generated using these self.sentence_labels but what happens is the super().__init__ goes first and there it uses the sentence_labels from ContrastiveDataset
but with the sentence_labels defined by ContrastiveDataset.
This results in creating pairs of sentences that are always selecting the first row first column element from the cos_matrix as label.
this is what you expect by looking at the code
because the labels in the parent are set as 0 for each sentence. Then after the init you set self.sentence_labels = list(enumerate(self.sentences)) but this is never used as the pairs are already generated, very nasty and easily overlooked bug. This unfortunately makes ContrastiveDistillationDataset unusable.
The text was updated successfully, but these errors were encountered:
DemirTonchev
changed the title
ContrastiveDataset generates pair of same sentence by default.
ContrastiveDataset iterator and ContrastiveDistillationDataset bug.
Dec 25, 2024
(updated)
ContrastiveDataset
ContrastiveDataset by default generates and builds the whole positive and negative examples internally in lists. Then in trainer.py
setfit/src/setfit/trainer.py
Line 605 in 146c7c9
from_list method is used to create the dataset for training. This is fine with smaller datasets which I recognize is the whole point of the setfit method, yet if you want to build bigger dataset for whatever reason this can blow your RAM memory (as I have experienced). It would be best if you can build the dataset from a generator
Dataset.from_generator
While the current__iter__
method is correct the internal data that it uses is eagerly generated on__init__
. It would be best if the generator is truly "lazy".Other question I am not sure why is what is the reason that replacement is set to True by default? Although there are some cases where this makes sense the matrix picture in the sampling guide shows the diagonal as empty (ie I wont expect the library to do the opposite by default - https://huggingface.co/docs/setfit/v1.1.0/en/conceptual_guides/sampling_strategies) Also there is no way to control that.
setfit/src/setfit/sampler.py
Line 15 in 146c7c9
Using this setting we get pairs such as:
("The movie was awesome", "The movie was awesome") as a positive pair in the training process.
ContrastiveDistillationDataset
ContrastiveDistillationDataset inherits from ContrastiveDataset but set its own
setfit/src/setfit/sampler.py
Line 170 in 146c7c9
by looking at the code you would expect that the pairs are generated using these
self.sentence_labels
but what happens is thesuper().__init__
goes first and there it uses the sentence_labels from ContrastiveDatasetsetfit/src/setfit/sampler.py
Line 63 in 146c7c9
and this results in creating the pairs using the function defined in ContrastiveDistillationDataset
setfit/src/setfit/sampler.py
Line 178 in 146c7c9
but with the sentence_labels defined by ContrastiveDataset.
This results in creating pairs of sentences that are always selecting the first row first column element from the cos_matrix as label.
this is what you expect by looking at the code
But because the init method runs with the sentence_labels generated from the parent you always get:
because the labels in the parent are set as 0 for each sentence. Then after the init you set
self.sentence_labels = list(enumerate(self.sentences))
but this is never used as the pairs are already generated, very nasty and easily overlooked bug. This unfortunately makes ContrastiveDistillationDataset unusable.The text was updated successfully, but these errors were encountered: