This repository contains the workflow for performing differential expression analysis on publicly available Amblyomma americanum data. The repo accompanies the pub, "An interactive visualization tool for Amblyomma americanum differential expression data."
This project performed differential expression analysis using A. americanum RNA-seq data. Differential gene expression analysis is used to identify changes in gene expression between one group of samples and another. Genes can either be induced or repressed relative to a control, or show no statistically significant change in expression.
The publicly available data analyzed here had metadata for the variables sex (male, female), tissue (whole tick, salivary gland, midgut), and time in blood meal. The final product of this analysis is a Shiny App that allows users to explore the results of differential expression analyses for these variables. The purpose of this dashboard is to allow users to explore the expression of genes of interest in various tick tissues and at different times in a blood meal. We hope this information can be used to limit potential genes of interest based on location or timing of expression.
The data analyzed in this repository are obtained from the following SRA studies:
- SRP091404: 6 samples, Single-end short reads, Reference
- SRP446981: 24 samples, Paired-end short-read Reference
- SRP032795: 12 samples (bio-panned excluded), Paired-end short-read Reference
- SRP051699: 4 samples, Paired-end short-read Reference
- SRP052078; SRP052091; SRP052108; SRP052106; SRP052114; SRP052123; SRP052145; SRP052154: 8 samples, Paired-end short-read (not published, no reference)
We download reads with SRA tools fasterq-dump and quality and adapter trimmed reads with fastp.
We quantified transcripts using salmon
.
We used the reference transcriptome assembled in the repository 2023-amblyomma-americanum-txome-assembly to quantify read counts.
Salmon produces transcript counts however differential expression results are more accurate when gene counts, not transcript counts, are compared.
The most common way to summarize transcript counts to gene counts is to use a transcript to gene mapping file.
The R package tximport
uses a tx2gene
file to sum the counts for all transcripts that encode the same gene and to report the gene-level counts.
To generate a transcript -> gene map (tx2gene
file) for gene-level quantification, we first mapped the reference transcripts back to the genome using uLTRA.
Then, we assigned a gene name to a transcript when it overlapped part of the genes interval as annotated in the reference GFF3 annotation file produced in the repository 2023-amblyomma-annotation.
This step is performed by the script assign_mapped_transcripts_to_gene_by_gtf_overlap.py.
We imported transcript counts and summarized them to gene counts using the tximport
package function tximport()
with the parameter type = salmon
.
We then built differential expression models using the DESeq2
package commands DESeqDataSetFromTximport()
and DESeq()
This step is performed by the scripts build_diffex_models.R.
The previous steps are orchestrated by a Snakefile. The main outputs of this workflow become the inputs to the Shiny App. The Shiny App hosts a reactive user interface to explore the results from the differential expression models. Please refer to the README in the shiny app folder for more information.
While we had 54 input samples, we were only able to analyze 20. An initial exploration of sample similarity is available in the notebook 20231013-differential-expression-by-groups.ipynb. Below we discuss exclusion criteria for each set of samples that we removed before performing differential expression analysis.
- SRP091404: From the six initial samples, we excluded four, samples Um, Uf, Im, and If as these samples were all exposed to the pathogen Ehrlichia chaffeensis Reference. Since no other RNA-seq samples in other studies were exposed to this pathogen, we had no way to account for this variable. Further, these samples didn't have replicates, which are required to perform differential expression.
- SRP446981: These 24 samples were excluded because of batch effects. All samples from the study clustered tightly together and away from other similar biological replicates from other studies, indicating the batch effects were too strong to make cross-study comparisons. These samples were all unfed female whole ticks injected either with either PBS (control) or Escherichia coli, so it is possible that the injections caused a biological impact that led to the batch effects, but this is not possible to evaluate with the available data Reference.
- SRP052078; SRP052091; SRP052108; SRP052106; SRP052114; SRP052123; SRP052145; SRP052154: These eight samples originate from two A. americanum cell lines. The samples all cluster tightly together and away from other samples. However, since we have no other cell line data from other studies, there is no way for us to evaluate whether they cluster alone because they have different expression or strong batch effects.
The final set of samples we used to perform differential expression are described in the table below.
library_name | sex_tissue | sex_tissue_blood_meal_hour | run_accession | bioproject | sex | tissue | blood_meal_hour | blood_meal_hour_range | total_spots | publication_doi |
---|---|---|---|---|---|---|---|---|---|---|
AmbamSG7-11d | female_x_salivary_gland | NA | SRR1740611 | PRJNA218793 | female | salivary_gland | 168;192;216;264 | 168_264 | 44443612 | https://doi.org/10.1371/journal.pone.0131292 |
AmbamSG72-144h | female_x_salivary_gland | female_x_salivary_gland_x_72_144 | SRR1740609 | PRJNA218793 | female | salivary_gland | 72;120;144 | 72_144 | 49791091 | https://doi.org/10.1371/journal.pone.0131292 |
AmbameSG12-18h | female_x_salivary_gland | female_x_salivary_gland_x_12_48 | SRR1740608 | PRJNA218793 | female | salivary_gland | 12;18;24;36;48 | 12_48 | 25095072 | https://doi.org/10.1371/journal.pone.0131292 |
AmbamSGunfed | female_x_salivary_gland | NA | SRR1740607 | PRJNA218793 | female | salivary_gland | 0 | 0 | 53124914 | https://doi.org/10.1371/journal.pone.0131292 |
PL17 | male_x_whole | male_x_whole_x_72_144 | SRR1027761 | PRJNA226980 | male | whole | 72 | 72_144 | 413856 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
PL3 | male_x_whole | male_x_whole_x_72_144 | SRR1027763 | PRJNA226980 | male | whole | 72 | 72_144 | 850276 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
PL2 | female_x_whole | female_x_whole_x_72_144 | SRR1027762 | PRJNA226980 | female | whole | 96 | 72_144 | 1031591 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
PL1 | female_x_whole | NA | SRR1027751 | PRJNA226980 | female | whole | 24 | 12_48 | 740878 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
AAFM | male_x_whole | male_x_whole_x_72_144 | SRR1027485 | PRJNA226980 | male | whole | 72 | 72_144 | 50887 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
AAUM | male_x_whole | NA | SRR1027483 | PRJNA226980 | male | whole | 0 | 0 | 78212 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
AAFF | female_x_whole | female_x_whole_x_72_144 | SRR1027481 | PRJNA226980 | female | whole | 96 | 72_144 | 95726 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
AAUF | female_x_whole | NA | SRR1027479 | PRJNA226980 | female | whole | 0 | 0 | 352313 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
MG120 | female_x_midgut | female_x_midgut_x_72_144 | SRR1027477 | PRJNA226980 | female | midgut | 120 | 72_144 | 81897 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
MG96 | female_x_midgut | female_x_midgut_x_72_144 | SRR1027476 | PRJNA226980 | female | midgut | 96 | 72_144 | 40713 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
MG48 | female_x_midgut | NA | SRR1027475 | PRJNA226980 | female | midgut | 48 | 12_48 | 61002 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
SG120 | female_x_salivary_gland | female_x_salivary_gland_x_72_144 | SRR1027474 | PRJNA226980 | female | salivary_gland | 120 | 72_144 | 67277 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
SG96 | female_x_salivary_gland | female_x_salivary_gland_x_72_144 | SRR1027473 | PRJNA226980 | female | salivary_gland | 96 | 72_144 | 86616 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
SG48 | female_x_salivary_gland | female_x_salivary_gland_x_12_48 | SRR1027471 | PRJNA226980 | female | salivary_gland | 48 | 12_48 | 179494 | https://doi.org/10.1186/s13071-017-2080-1; https://doi.org/10.1186/1471-2164-15-518 |
NEm | male_x_whole | NA | SRR4416251 | PRJNA327120 | male | whole | 168 | 168_264 | 39445131 | https://doi.org/10.1016/j.aspen.2018.05.009 |
NEf | female_x_whole | NA | SRR4416250 | PRJNA327120 | female | whole | 168 | 168_264 | 28403332 | https://doi.org/10.1016/j.aspen.2018.05.009 |
The majority of samples that we included in our differential expression analysis were not originally sequenced for differential expression analysis. This means that most samples did not have replicates; you need at least 2 samples per condition to perform differential expression and statistical power to detect true gene expression differences will increase with increasing numbers of samples. Further, to build a differential expression model with the software DESeq2, the statistical design for multi-variable comparisons requires that all the observed variables are present in each combination. For example, let's say we're studying expression in males and females when treated with a drug. DESeq2 would require the following:
sample | sex | treatment |
---|---|---|
s1 | male | control |
s2 | male | control |
s3 | male | treatment |
s4 | male | treatment |
s5 | female | control |
s6 | female | control |
s7 | female | treatment |
s8 | female | treatment |
All combinations of sex and treatment are observed. DESeq2 will fail with an error if we have something like this:
sample | sex | treatment |
---|---|---|
s1 | male | control |
s2 | male | control |
s3 | male | treatment |
s4 | male | treatment |
s5 | female | control |
s6 | female | control |
DESeq2 wouldn't know how to compare male treatment against female control without female treatment also being present. However, it is possible to trick DESeq2 into building a model by combining variables. So continuing with the above example, we could do:
sample | sex_treatment |
---|---|
s1 | male_control |
s2 | male_control |
s3 | male_treatment |
s4 | male_treatment |
s5 | female_control |
s6 | female_control |
DESeq2 can build this model. Then, we can go in and make the comparisons that make sense to make given this experimental design:
- male_control versus male_treatment
- male_control versus female_control
Since our data were sparse, we had to combine our variables of interest as shown above to be able to build a model. We were able to include the most number of samples by combining the variables sex and tissue:
sex_tissue | n |
---|---|
female_x_midgut | 3 |
female_x_salivary_gland | 7 |
female_x_whole | 5 |
male_x_whole | 5 |
However, many people at Arcadia care about how gene expression varies based on time in the blood meal. Given this, we built a second model included time in the blood meal hour, only including samples with replicates for different times in the blood meal hour.
sex_tissue_blood_meal_hour | n |
---|---|
female_x_midgut_x_72_144 | 2 |
female_x_salivary_gland_x_12_48 | 2 |
female_x_salivary_gland_x_72_144 | 3 |
female_x_whole_x_72_144 | 2 |
male_x_whole_x_72_144 | 3 |
In general, it is better to analyze all samples in a single model, but given the limitations of working with this data, we worked within the bounds of what was possible.
The goal of this project is to allow Arcadians to explore A. americanum differential expression for genes of interest. To that end, the primary output of this project is a Shiny App that allows users to explore the results of the differential expression models we described above. Please see the shiny folder for more information about the model.
We tried a lot of different approaches to get a gene count matrix, summarized below.
- Genome mapping: We mapped RNA-seq reads against the A. americanum genome using STAR and then counted genes using HTSeq delineating genes with the EVM gene annotations. Some libraries had very low mapping rates, I think because the source ticks were less genetically similar to the reference. For libraries that had ~high mapping rates, the percent of reads that were counted as genes was low. Given these low mapping and counting rates, we decided to map against the transcriptome and then explored methods for determining which transcripts encoded which genes. See the notebook 20231027-comparing-counting-methods.ipynb for an exploration of the genome mapping rates.
- Clustering transcripts by shared mapped reads: We experimented with the software libraries Grouper, Compacta, and Corset. All three libraries look at shared mapping and expression of multi-mapped reads to determine which transcripts are likely isoforms of the same genes. We found that Compacta was uninstallable, Grouper errored out in when parsing the config file with a basic Python error, and Compacta ran for over 9 days without progressing past 10% finished. While we think the theory behind these approaches in appropriate for this problem, we didn't pursue these approaches further because the software tools were too difficult to use.
- Experimented with new clustering techniques based on shared k-mers in transcripts: We briefly experiment with new approaches for clustering transcripts into genes based on shared k-mer content in this GitHub repository. We tried two tools, spacegraphcats and kSpider. spacegraphcats clusters transcripts together based on proximity in a compact de Bruijn graph, while kSpider clusters them based on a threshold containment of shared k-mers between transcripts. We found both methods were very promising in initial evaluations and are interested in pursuing both further, but spacegraphcats suffers from a bug when using long k-mer sizes while kSpider isn't well documented so it's unclear how to execute it appropriately. We hope to continue experimenting with these methods when these issues are resolved.
Given these experiments, we decided to use to map against the transcriptome (since it had higher mapping rates, see 20231027-comparing-counting-methods.ipynb) and to assign genes to transcripts based on the gene each transcript overlapped when mapped against the genome. We aren't thrilled with this outcome, as it requires an annotated reference genome, but think it was a good enough approach to use in this project.
inputs/
: Directory containing metadata and other necessary inputs for the workflow.scripts/
: Directory containing scripts for data processing and analysis.envs/
: Directory containing conda environment files for software dependencies.Snakefile
: The main workflow script orchestrating the analysis.notebooks
: Directory containing analysis notebooks. These notebooks are scratch analyses that were used to determine the best next steps for the Snakemake pipeline or for visualizations to include in the Shiny App.shiny
: Directory containing the code and data required to run the A. americanum Differential Expression Explorer Shiny App. Also contains the instructions for running the app locally.
The main analysis is orchestrated by a Snakefile
which defines various rules for processing the data.
Here is a brief description of the workflow:
- Download Data: Downloads the raw sequencing data from SRA using
fasterq-dump
. - Pre-processing: Combines multiple run accessions into one file per Illumina library name, performs quality control and trimming with
fastp
, and separates paired-end reads. - Read Quantification: Indexes the transcriptome using
Salmon
, quantifies transcripts, and creates a tx2gene file by mapping transcripts back to the genome usinguLTRA
and assigning transcripts gene names by determining the gene each transcript overlaps. - Differential Expression Analysis: Builds differential expression models to study the effects of sex, tissue, and blood meal hour on gene expression. These differential expression models are then used as input to the Shiny App.
The Snakefile contains additional documentation detailing the purpose of each rule.
We used the Canonical, Ubuntu, 22.04 LTS, amd64 jammy image build on 2023-05-16 with 64 bit architecture and AMI ID ami-0f8e81a3da6e2510a. We initially launched an m5a.large instance, and after configuration ran the pipeline on m5a.2xlarge instance type. To set up the instance to run the pipeline, we installed and configured miniconda with the commands below.
curl -JLO https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # download the miniconda installation script
bash Miniconda3-latest-Linux-x86_64.sh # run the miniconda installation script. Accept the license and follow the defaults.
source ~/.bashrc # source the .bashrc for miniconda to be available in the environment
# configure miniconda channel order
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
conda install mamba # install mamba for faster software installation.
First, clone the repository to your computer.
git clone [email protected]:Arcadia-Science/2023-amblyomma-americanum-diffex.git # assumes SSH is configured
This repository uses snakemake to run the pipeline and conda to manage software environments and installations.
You can find operating system-specific instructions for installing miniconda here (see above for linux).
After installing conda and mamba, run the following command to create the pipeline run environment.
We installed Miniconda3 version py311_23.5.2-0
and mamba version 1.4.9
.
mamba env create -n amam --file environment.yml
conda activate amam
To start the pipeline, run:
snakemake --use-conda -j 2
The pipeline processes accessions listed in a TSV file, which defaults to inputs/metadata.tsv
.
This can be changed on line 4 of the Snakefile.
See this guide to see how we recognize feedback and contributions on our code.