This is the official implementation of Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias, AISTATS 2024.
We provide the implementation of the
We provide the following implementations.
Sample selection bias (SSB) occurs when data labeling is subject to constraints resulting in a distribution mismatch between labeled and unlabeled data. We illustrate below the two types of labeling considered in our paper:
- IID: The usual uniform labeling that verifies the i.i.d. assumption;
- SSB: Distribution shift between labeled and unlabeled data.
We provide the PyTorch implementation of the
- Backpropagation of the diversity loss only influences the ensemble, not the projection layers;
- In practice, we use
$M=5$ heads resulting in lightweight and fast training; - Compatible to any SSL methods using neural networks as backbones.
Using pip:
pip install git+https://github.com/ambroiseodt/tsim.git#egg=tsim
Or clonning:
git clone https://github.com/ambroiseodt/tsim.git
We provide demos in notebooks/
to take in hand the implementation and reproduce the figures of the paper:
-
plot_intro_figure.ipynb
: Overview of the method (Figure 1) -
plot_sample_selection_bias.ipynb
: Visualization of the sample selection bias (Figure 3) -
plot_calibration.ipynb
:$\mathcal{T}$ -similarity corrects overconfidence of the softmax (Figure 6)
The code below (in demo.ipynb
) gives an example of how to train the architecture introduced above:
import sys
sys.path.append("..")
from tsim.datasets.read_dataset import RealDataSet
from tsim.models.diverse_ensemble import DiverseEnsembleMLP
dataset_name = "mnist"
gamma = 1
n_classifiers = 5
seed = 0
nb_lab_samples_per_class = 10
test_size = 0.25
num_epochs = 5
n_iters = 100
selection_bias = True
# Data split
dataset = RealDataSet(dataset_name=dataset_name, seed=seed)
# Percentage of labeled data
num_classes = len(list(set(dataset.y)))
ratio = num_classes / ((1 - test_size) * len(dataset.y))
lab_size = nb_lab_samples_per_class * ratio
# Split
x_l, x_u, y_l, y_u, x_test, y_test, n_classes = dataset.get_split(
test_size=test_size, lab_size=lab_size, selection_bias=selection_bias
)
# Define base classifier
base_classifier = DiverseEnsembleMLP(
num_epochs=num_epochs,
gamma=gamma,
n_iters=n_iters,
n_classifiers=n_classifiers,
device="cpu",
verbose=False,
random_state=seed,
)
# Train
base_classifier.fit(x_l, y_l, x_u)
This package consists of several key modules:
-
notebooks/
: Contains the notebooks to reproduce the figures from the paper; -
data/
: Contains the datasets used in our experiments; -
tsim/datasets
: Contains the functions to load datasets and perform the labeling procedure; -
tsim/models/
: Contains all the functions to train diverse ensembles with the$\mathcal{T}$ -similarity
Warning
The code is still in development and we will add the following components very soon:
- Visualization of ECE for softmax and
$\mathcal{T}$ -similarity (Figure 5) - Self-training algorithms
- Extended requirements.txt
To get started with the
git clone https://github.com/ambroiseodt/tsim.git
pip install -e .[dev]
Please, make sure you have Python 3.8 or a newer version installed.
This project is licensed under the MIT License. See the LICENSE file for more details.
If you use our code in your research, please cite:
@InProceedings{pmlr-v238-odonnat24a,
title = { Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias },
author = {Odonnat, Ambroise and Feofanov, Vasilii and Redko, Ievgen},
booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v238/odonnat24a/odonnat24a.pdf},
url = {https://proceedings.mlr.press/v238/odonnat24a.html},
}