Evaluating predictive patterns of antigen-specific B cells by single- cell transcriptome and antibody repertoire sequencing
This repository contains code to perform the analysis performed by L. Erlach, et al.
The field of antibody discovery typically involves extensive experimental screening of B cells from immunized animals. Machine learning (ML)-guided prediction of antigen-specific B cells could accelerate this process, but requires sufficient training data with antigen-specificity labeling. Here, we introduce a dataset of single-cell transcriptome and antibody repertoire sequencing of B cells from immunized mice, which are labeled as antigen-specific or non-specific through experimental selections. We identify gene expression patterns associated with antigen-specificity by differential gene expression analysis and assess their antibody sequence diversity. Subsequently, we benchmark various ML models, both linear and non-linear, trained on different combinations of gene expression and antibody repertoire features. Additionally, we assess transfer learning using features from general and antibody-specific protein language models (PLMs). Our findings show that gene expression-based models outperform sequence-based models for antigen-specificity predictions, highlighting a promising avenue for computational-guided antibody discovery.
conda env create -f environment_scabpred.yml
conda activate abpred
The raw sequencing data is deposited in SRA under the BioProject number: PRJNA1124428.
- The sequencing files were processed with cellranger (v5.0.0) and the scripts for the alignment of the files is in
scSeq_preprocess/1_Cellranger_alignment_GEX.sh
andscSeq_preprocess/2_Cellranger_alignment_VDJ.sh
- Preprocessed single cell sequencing data was further processed in R, mainly utilizing the Playtpus and Seurat packages. The analysis, including the differential expression analysis is in
scSeq_preprocess/
- Preprocessing of the datasets for ML model evaluations is in
notebooks/ML_preprocess/
Features for the ML model evaluations were generated in the Jupyter notebooks in notebooks/ML_preprocess/
- Gene expression data was processed in
003_GEX_dataprep.ipynb
- Antibody sequencing data was processed in
001_VDJ_OVA_seq_preprocessing.ipynb
and001.2_VDJ_RBD_seq_preprocessing.ipynb
- The PLM embeddings were generated with the notebooks in
notebooks/ESM_embed/001_Generate_ESM_embeddings.ipynb
,notebooks/ESM_embed/002_Extract_CDR3Embeddings.ipynb
,notebooks/Antiberty_embed/Embed_seqs_antiberty.ipynb
andnotebooks/ESM3_embed/Embed_seqs_ESM3.ipynb
Scripts for training and evaluating the ML models are in
src/Spec_classification/Specificity_classification_script.py
src/GEX/Spec_classification/Specificity_classification_script.py
These scripts can be executed as shown in the example below.
./Specificity_classification_script.py --config path_to_config --simsplit_thresh 0.05 --chaintype VH_VL --outpath path_to_save_results
Visualization and summarization of the results of the model evaluation are performed with the jupyter notebooks notebooks/model_evaluations/Metrics_visualization.ipynb
and notebooks/model_interpretation/LogReg_Koefficient_analysis.ipynb
. The latter also contains the biological analysis of the LogReg models.