RNA-seq pipeline
The raw metatranscriptomic reads were processed using Fastp to filter low-quality data and adapter contaminations and generate the clean reads for further analyses. Human-derived reads were identified with the following steps: 1) identification of human ribosomal RNA (rRNA) by aligning clean reads to human rRNA sequences using BWA-MEM ; 2) identification of human transcripts by mapping reads to the hg19 reference genome using the RNA-seq aligner HISAT2 ; and 3) a second-round identification of human reads by aligning remaining reads to hg 38 using Kraken 2. All human RNA reads were then removed to generate qualified non-human RNA-seq data.
The remaining non-human non-rRNA reads were processed by Kraken 2X v2.08 beta. Non-viral microbial taxon assignment of the non-human non-rRNA reads was performed using clade-specific marker gene-based MetaPhAln2 with the default parameter options for non-viral microbial composition(--ignore-viruses).
python: v3+
Software for This pipeline:
git clone https://github.com/rusher321/RNA-seq-2019nCov.git
Notes: The above dependent software needs to be installed separately according to their instructions. After installing, the users should edit the config.yaml file, and change the software path to your own path.
- bulit the human rna index for bwa
bwa index Human_rRNA_NCBI.fa
- bulit the human genome index for HISAT2
hisat2-build index hg19.fa hg19 -p 6
- bulit the kraken2 database index
kraken-build --build --threads 8 --db ./YourDBpath/
# add the human genome to the database
kraken2-build --add-to-library hg38.fa --db ./YourDBpath/
# add the HCoV-19 genome to the database
kraken2-build --add-to-library HCoV-19.fa --db ./YourDBpath/
Here we used the MiniKraken2_v2_8GB: (5.5GB) 8GB Kraken 2 Database built from the Refseq bacteria, archaea, and viral libraries and the GRCh38 human genome
- bulit the kraken2x database index
kraken2-build --build --protein --db $DBNAME
- Edit the config.yaml file, and change the database path to your own path
Input requirements
generate a sample information file like below:
id | fq1 | fq2 |
---|---|---|
demo1 | demo1.1.fq.gz | demo1.2.fq.gz |
demo2 | demo2.1.fq.gz | demo2.2.fq.gz |
The header must be: id fq1 fq2.
Init
cd
to your workdir and run:
python /path/to/git/RNAseq init -d ./ -s samples.tsv
After that, in yourdir
directory, inital files will be generated
ls ./
assay
results
scripts
sources
study
config.yaml
cluster.yaml
generate command line and just run it on local computer
python /path/to/your/git/RNAseq commandline -d ./ -u all
snakemake --snakefile /path/to/your/git/Snakefile --configfile config.yaml --until all
Or submit to cluster using qsub
snakemake --snakefile /path/to/git/Snakefile \
--configfile ./config.yaml \
--cluster-config ./cluster.yaml \
--jobs 80 \
--cluster "qsub -S /bin/bash -cwd \
-q {cluster.queue} \
-P {cluster.project} \
-l vf={cluster.mem},p={cluster.cores} \
-binding linear:{cluster.cores} \
-o {cluster.output} \
-e {cluster.error}" \
--latency-wait 360 \
-k \
--until all
Please log an issue on github issue
- Huahui Ren
- Zhun Shi
Thanks the support from Jie Zhu - @alienzj, Jiahui Zhu, Fangming Yang.
Released under the MIT license.