scSemiProfiler: Advancing Large-scale Single-cell Studies through Semi-profiling with Deep Generative Models and Active Learning

scSemiProfiler is an innovative computational tool that combines deep generative models and active learning to economically generate single-cell data for biological studies. It supports two main application scenarios: semi-profiling, which uses deep generative learning and active learning to generate a single-cell cohort with 1/10 to 1/3 sequencing cost, and single-cell level deconvolution, which generates single-cell data from bulk data and single-cell references. For more insights, check out our manuscript on Nature Communications, and please consider citing it if you find our method beneficial.

Explore comprehensive details, including API references, usage examples, and tutorials (in Jupyter notebook format), in our full documentation and the README below.

Updates:

New Single-Cell Level Deconvolution Pipeline: A simplified pipeline has been added to scSemiProfiler for generating single-cell data from bulk RNA-seq profiles using a single-cell reference sample. See the Application Scenarios section for details.
Global Mode Functions: New global mode functions "inspect_data" and "global_stop_checking" have been introduced. For details, use print(scSemiProfiler.utils.inspect_data.__doc__) and print(scSemiProfiler.utils.global_stop_checking.__doc__).

Application Scenarios

1. Single-Cell Level Deconvolution

This process allows users to deconvolute bulk RNA-seq data from a target sample into single-cell data, using a single-cell reference sample as a guide. Users need to provide bulk data for both the target and reference samples. The single-cell reference can be derived from real sequencing data or any similar online dataset. Once the pipeline is completed, single-cell data for the target sample is generated and can be used for cell type annotation. This includes de novo annotation or utilizing a classifier trained on the reference data. For further guidance, please refer to the deconvolution_example.ipynb.

2. Semi-Profile a Cohort

With bulk data for a cohort, select a few representative samples using active learning for real single-cell sequencing and computationally generate single-cell data for the rest target samples. Getting single-cell data using less than 1/3 cost. Example in example.ipynb.

Methods Overview

scSemiProfiler Overview: scSemiProfiler offers a cost-effective AI-generated alternative to real-profiled single-cell data with high fidelity.

a, Curating bulk and reference single-cell data: Bulk sequencing is performed across the entire cohort. The single-cell reference data can either be provided by the user (e.g., a public reference dataset) or obtained from selected representative samples within the cohort under study. Representative samples can be chosen based on clustering analysis of the bulk data (the global mode of scSemiProfiler) or by using the active learning module.

b, In silico inference of target single-cell data from bulk profiles: For each target sample, a deep generative model first learns the distribution of the reference single-cell data, generating reconstructions of the reference cells. Subsequently, the bulk information of the target sample is incorporated into the cell generation process via fine-tuning, producing single-cell data that matches the target bulk. This AI-powered semi-profiling framework significantly reduces single-cell profiling costs for large cohorts (e.g., a 66.3% savings in the example COVID-19 study). Cost estimates are based on rates from the McGill Genome Centre and costpercell as of 2023.

c, High fidelity between cost-effective AI-generated semi-profiled and ground-truth single-cell data: Left: UMAP visualization shows that the inferred target sample’s single cells (red), generated based on reference cells (blue), closely resemble the real-profiled ground truth of the target sample (red; unknown to the model). Middle: UMAP visualizations compare the real-profiled and semi-profiled COVID-19 cohort with 124 samples, which are similar in terms of cell distribution and cell types (indicated by colors, which are consistent with the legends on the right). Right: Stacked bar plots indicate that the semi-profiled cohort has nearly identical cell type proportions across disease conditions compared to the real-profiled ground truth.

Prerequisites

First, install Anaconda. You can find specific instructions for different operating systems here.

Second, create a new conda environment and activate it:

conda create -n semiprofiler python=3.9

conda activate semiprofiler

Finally, install the version of PyTorch compatible with your devices by following the instructions on the official website.

Installation

There are 2 options to install scSemiProfiler.

Option 1: Install from download directory
download scSemiProfiler from this repository, go to the downloaded scSemiProfiler package root directory, and use the pip tool to install
```
 pip install .
```

Option 2: Install from Github:

 pip install --upgrade https://github.com/mcgilldinglab/scSemiProfiler/zipball/main

Credits and Acknowledgements

scSemiProfiler was developed by Jingtao Wang, Gregory Fonseca, and Jun Ding at McGill University, with support from the Canadian Institutes of Health Research (CIHR), Fonds de recherche du Québec – Santé (FRQS), and the Natural Sciences and Engineering Research Council of Canada (NSERC). Additional funding was provided by the Meakins-Christie Chair in Respiratory Research. This work is part of the Human Cell Atlas (HCA) publication bundle (HCA-8).

Contacts

Please don't hesitate to contact us if you have any questions and we will be happy to help:

jingtao.wang at mail.mcgill.ca
gregory.fonseca at mcgill.ca
jun.ding at mcgill.ca

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.ipynb_checkpoints		.ipynb_checkpoints
docs		docs
example_data		example_data
genesets		genesets
scSemiProfiler		scSemiProfiler
.Rhistory		.Rhistory
.readthedocs.yaml		.readthedocs.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
cellchat_eg.ipynb		cellchat_eg.ipynb
deconvolution_example.ipynb		deconvolution_example.ipynb
environment.txt		environment.txt
example.ipynb		example.ipynb
inference_example.jpg		inference_example.jpg
overview.jpg		overview.jpg
pyproject.toml		pyproject.toml
recon_example.jpg		recon_example.jpg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scSemiProfiler: Advancing Large-scale Single-cell Studies through Semi-profiling with Deep Generative Models and Active Learning

Table of Contents

Application Scenarios

1. Single-Cell Level Deconvolution

2. Semi-Profile a Cohort

Methods Overview

Prerequisites

Installation

Credits and Acknowledgements

Contacts

About

Releases 1

Packages

Contributors 2

Languages

License

mcgilldinglab/scSemiProfiler

Folders and files

Latest commit

History

Repository files navigation

scSemiProfiler: Advancing Large-scale Single-cell Studies through Semi-profiling with Deep Generative Models and Active Learning

Table of Contents

Application Scenarios

1. Single-Cell Level Deconvolution

2. Semi-Profile a Cohort

Methods Overview

Prerequisites

Installation

Credits and Acknowledgements

Contacts

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages