Releases: ruppinlab/CSI-Microbes-identification
bioRxiv Release April 2023
bioRxiv May 2021 paper
This release supports our bioRxiv May 2021 paper "CSI-Microbes: Identifying cell-type specific intracellular microbes from single-cell RNA-seq data".
Release v0.1.0
This release refers to the core CSI-Microbes identification code (which is generally contained in the git submodule pathogen-discovery-rules
). Each subdirectory contains specific options for the specific dataset, which are not subject to the release. It should be possible to reproduce exactly results by combining the dataset specific options (contained in config/PathSeq-config.yaml
) with the correct CSI-Microbes identification code tag.
Common CSI-Microbes Component
This version of CSI-Microbes uses PathSeq (v4.1.8.1) to identify microbial reads. It uses the standard options with the exception of --min-score-identity .7
, --skip-quality-filters true
and --filter-duplicates false
. Unless otherwise specified, it uses the host BWA image file and host k-mer file distributed by PathSeq. Unless otherwise specified, the reads are initially mapped using STAR against the human genome GRCh38.p13 (including scaffolds and alternative loci) with the full annotation from Gencode v34 or mapped using CellRanger (v4.0.0) against the human reference genome distributed with CellRanger.
CSI-Microbes identification on 10x data
First, fastq files are aligned to the human reference genome using CellRanger (v4.0.0). Next, any aligned reads are filtered. Next, using annotations provided by CellRanger, template sequence oligonucleotides and polyA tails are hard-clipped and any reads with length < 15 nucleotides or missing a valid cell barcode (CB) or unique molecular identifier (UMI) tag are removed. Next, the reads are hard-clipped (--cut_tail
) and filtered for read length (--length_required 25
), low complexity (--low_complexity_filter 30
) and low-quality (--unqualified_percent_limit 40
) using fastp. The cleaned fastq file is converted to a BAM file and processed through PathSeq. The output BAM of PathSeq is then combined with the filtered CellRanger output BAM to add the necessary CB and UMI tags. This BAM is filtered by valid cell barcode and the best mapping UMI is selected and this cell-specific BAM is re-scored by PathSeq.
CSI-Microbes on full-length scRNA-seq datasets
First, the paired fastq files are hard-clipped (--cut_tail
), adapter sequences removed, and filtered for read length (--length_required 25
), low complexity (--low_complexity_filter 30
) and low-quality (--unqualified_percent_limit 40
) using fastp. Then, the paired fastq files are aligned to the human genome using STAR. Next, any aligned reads are removed using STAR's uT
tag. Finally, the BAM is run through PathSeq.