diff --git a/docs/usage/DEanalysis/de_rstudio.md b/docs/usage/differential_expression_analysis/de_rstudio.md similarity index 97% rename from docs/usage/DEanalysis/de_rstudio.md rename to docs/usage/differential_expression_analysis/de_rstudio.md index f3adcc30b..6d9c4dd2e 100644 --- a/docs/usage/DEanalysis/de_rstudio.md +++ b/docs/usage/differential_expression_analysis/de_rstudio.md @@ -1,5 +1,6 @@ --- order: 4 +shortTitle: RStudio --- # Differential Analysis with DESeq2 @@ -33,9 +34,7 @@ As in all analysis, firstly we need to create a new project: 2. Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**; -
- ![r_project](./img/project_R.png){ width="400" } -
+![r_project](../differential_expression_analysis/img/project_R.png) 3. The new project will be automatically opened in RStudio. @@ -48,9 +47,7 @@ To store our results in an organized way, we will create a folder named **de_res and save the file as **de_script.R**. From now on, each command described in the tutorial can be added to your script. The resulting working directory should look like this: -
- ![work_dir](./img/workdir_RStudio.png){ width="600" } -
+![work_dir](../differential_expression_analysis/img/workdir_RStudio.png) The analysis requires several R packages. To utilise them, we need to load the following libraries: @@ -162,9 +159,7 @@ design(dds_new) # to check the design formula Comparing the structure of the newly created dds (`dds_new`) with the one automatically produced by the pipeline (`dds`), we can observe the differences: -
- ![comparison_dds](./img/dds_comparison.png){ width="400" } -
+![comparison_dds](../differential_expression_analysis/img/dds_comparison.png) Before running the different steps of the analysis, a good practice consists in pre-filtering the genes to remove those with very low counts. This is useful to improve computional efficiency and enhance interpretability. In general, it is reasonable to keep only genes with a sum counts of at least 10 for a minimal number of 3 samples: @@ -438,7 +433,7 @@ plotCounts(dds_final, gene = "ENSG00000142192") dev.off() ``` -**heatmap**: plot of the normalised counts for all the significant genes obtained with the `pheatmap()` function. The heatmap provides insights into genes and sample relationships that may not be apparent from individual gene plots alone. +- **heatmap**: plot of the normalised counts for all the significant genes obtained with the `pheatmap()` function. The heatmap provides insights into genes and sample relationships that may not be apparent from individual gene plots alone. ```r #### Heatmap #### diff --git a/docs/usage/DEanalysis/img/DESeq_function.png b/docs/usage/differential_expression_analysis/img/DESeq_function.png similarity index 100% rename from docs/usage/DEanalysis/img/DESeq_function.png rename to docs/usage/differential_expression_analysis/img/DESeq_function.png diff --git a/docs/usage/DEanalysis/img/Excalidraw_RNAseq.png b/docs/usage/differential_expression_analysis/img/Excalidraw_RNAseq.png similarity index 100% rename from docs/usage/DEanalysis/img/Excalidraw_RNAseq.png rename to docs/usage/differential_expression_analysis/img/Excalidraw_RNAseq.png diff --git a/docs/usage/DEanalysis/img/MA_plot.png b/docs/usage/differential_expression_analysis/img/MA_plot.png similarity index 100% rename from docs/usage/DEanalysis/img/MA_plot.png rename to docs/usage/differential_expression_analysis/img/MA_plot.png diff --git a/docs/usage/DEanalysis/img/RNA_seq_scheme_tutorial.png b/docs/usage/differential_expression_analysis/img/RNA_seq_scheme_tutorial.png similarity index 100% rename from docs/usage/DEanalysis/img/RNA_seq_scheme_tutorial.png rename to docs/usage/differential_expression_analysis/img/RNA_seq_scheme_tutorial.png diff --git a/docs/usage/DEanalysis/img/count_distribution.png b/docs/usage/differential_expression_analysis/img/count_distribution.png similarity index 100% rename from docs/usage/DEanalysis/img/count_distribution.png rename to docs/usage/differential_expression_analysis/img/count_distribution.png diff --git a/docs/usage/DEanalysis/img/dds_comparison.png b/docs/usage/differential_expression_analysis/img/dds_comparison.png similarity index 100% rename from docs/usage/DEanalysis/img/dds_comparison.png rename to docs/usage/differential_expression_analysis/img/dds_comparison.png diff --git a/docs/usage/DEanalysis/img/dispersion_estimates.png b/docs/usage/differential_expression_analysis/img/dispersion_estimates.png similarity index 100% rename from docs/usage/DEanalysis/img/dispersion_estimates.png rename to docs/usage/differential_expression_analysis/img/dispersion_estimates.png diff --git a/docs/usage/DEanalysis/img/enrichment_plot.png b/docs/usage/differential_expression_analysis/img/enrichment_plot.png similarity index 100% rename from docs/usage/DEanalysis/img/enrichment_plot.png rename to docs/usage/differential_expression_analysis/img/enrichment_plot.png diff --git a/docs/usage/DEanalysis/img/heatmap_de_genes.png b/docs/usage/differential_expression_analysis/img/heatmap_de_genes.png similarity index 100% rename from docs/usage/DEanalysis/img/heatmap_de_genes.png rename to docs/usage/differential_expression_analysis/img/heatmap_de_genes.png diff --git a/docs/usage/DEanalysis/img/hierarchical_clustering.png b/docs/usage/differential_expression_analysis/img/hierarchical_clustering.png similarity index 100% rename from docs/usage/DEanalysis/img/hierarchical_clustering.png rename to docs/usage/differential_expression_analysis/img/hierarchical_clustering.png diff --git a/docs/usage/DEanalysis/img/nf-core-rnaseq_metro_map_grey.png b/docs/usage/differential_expression_analysis/img/nf-core-rnaseq_metro_map_grey.png similarity index 100% rename from docs/usage/DEanalysis/img/nf-core-rnaseq_metro_map_grey.png rename to docs/usage/differential_expression_analysis/img/nf-core-rnaseq_metro_map_grey.png diff --git a/docs/usage/DEanalysis/img/overdispersion.png b/docs/usage/differential_expression_analysis/img/overdispersion.png similarity index 100% rename from docs/usage/DEanalysis/img/overdispersion.png rename to docs/usage/differential_expression_analysis/img/overdispersion.png diff --git a/docs/usage/DEanalysis/img/pca_plot.png b/docs/usage/differential_expression_analysis/img/pca_plot.png similarity index 100% rename from docs/usage/DEanalysis/img/pca_plot.png rename to docs/usage/differential_expression_analysis/img/pca_plot.png diff --git a/docs/usage/DEanalysis/img/plotCounts.png b/docs/usage/differential_expression_analysis/img/plotCounts.png similarity index 100% rename from docs/usage/DEanalysis/img/plotCounts.png rename to docs/usage/differential_expression_analysis/img/plotCounts.png diff --git a/docs/usage/DEanalysis/img/project_R.png b/docs/usage/differential_expression_analysis/img/project_R.png similarity index 100% rename from docs/usage/DEanalysis/img/project_R.png rename to docs/usage/differential_expression_analysis/img/project_R.png diff --git a/docs/usage/DEanalysis/img/volcanoplot.png b/docs/usage/differential_expression_analysis/img/volcanoplot.png similarity index 100% rename from docs/usage/DEanalysis/img/volcanoplot.png rename to docs/usage/differential_expression_analysis/img/volcanoplot.png diff --git a/docs/usage/DEanalysis/img/workdir_RStudio.png b/docs/usage/differential_expression_analysis/img/workdir_RStudio.png similarity index 100% rename from docs/usage/DEanalysis/img/workdir_RStudio.png rename to docs/usage/differential_expression_analysis/img/workdir_RStudio.png diff --git a/docs/usage/DEanalysis/interpretation.md b/docs/usage/differential_expression_analysis/interpretation.md similarity index 91% rename from docs/usage/DEanalysis/interpretation.md rename to docs/usage/differential_expression_analysis/interpretation.md index 476f627cd..4d2d1eb85 100644 --- a/docs/usage/DEanalysis/interpretation.md +++ b/docs/usage/differential_expression_analysis/interpretation.md @@ -14,17 +14,13 @@ The results illustrated in this section might show slight variations compared to The first plot we will examine is the Principal Component Analysis (PCA) plot. Since we're working with simulated data, our metadata is relatively simple, consisting of just three variables: `sample`, `condition`, and `replica`. In a typical RNA-seq experiment, however, metadata can be complex and encompass a wide range of variables that could contribute to sample variation, such as sex, age, and developmental stage. -
- ![pca](./img/pca_plot.png){ width="400" } -
+![pca](../differential_expression_analysis/img/pca_plot.png) By plotting the PCA on the PC1 and PC2 axes, using `condition` as the main variable of interest, we can quickly identify the primary source of variation in our data. By accounting for this variation in our design model, we should be able to detect more differentially expressed genes related to `condition`. When working with real data, it's often useful to plot the data using different variables to explore how much variation is explained by the first two PCs. Depending on the results, it may be informative to examine variation on additional PC axes, such as PC3 and PC4, to gain a more comprehensive understanding of the data. Next, we will examine the hierarchical clustering plot to explore the relationships between samples based on their gene expression profiles. The heatmap is organized such that samples with similar expression profiles are close to each other, allowing us to identify patterns in the data. -
- ![cluster](./img/hierarchical_clustering.png){ width="400" } -
+![cluster](../differential_expression_analysis/img/hierarchical_clustering.png) Remember that to create this plot, we utilized the `dist()` function, so in the legend on the right, a value of 0 corresponds to high correlation, while a value of 5 corresponds to very low correlation. Similar to PCA, we can see that samples tend to cluster together according to `condition`, indeed we can observe a high degree of correlation between the three control samples and between the three treated samples. @@ -35,11 +31,9 @@ Overall, the integration of these plots suggests that we are working with high-q In this part of the tutorial, we will examine plots that are generated after the differential expression analysis. These plots are not quality control plots, but rather plots that help us to interpret the results. After running the `results()` function, a good way to start to have an idea about the results is to look at the MA plot. -
- ![ma_plot](./img/MA_plot.png){ width="500" } -
+![ma_plot](../differential_expression_analysis/img/MA_plot.png) -By default, genes are coloured in blue if the padj is less than 0.1 and the log2 fold change greater than or less than 0. Genes that fall outside the plotting region are represented as open triangles. At this stage, we have not yet applied a filter to select only significant DE genes, which we define as those with a padj value less than 0.5 and a log2 fold change of at least 1 or -1. +By default, genes are coloured in blue if the padj is less than 0.1 and the log2 fold change greater than or less than 0. Genes that fall outside the plotting region are represented as open triangles. At this stage, we have not yet applied a filter to select only significant DE genes, which we define as those with a padj value less than 0.05 and a log2 fold change of at least 1 or -1. After filtering our genes of interest according to our threshold, let's have a look to our significatnt genes: @@ -54,25 +48,19 @@ ENSG00000156282 481.7624 1.095272 0.2969594 3.688289 To gain a comprehensive overview of the transcriptional profile, the volcano plot represents a highly informative tool. -
- ![volcano_plot](./img/volcanoplot.png){ width="400"} -
+![volcano_plot](../differential_expression_analysis/img/volcanoplot.png) The treatment induced differential expression in five genes: one downregulated and four upregulated. This plot visually represents the numerical results reported in the table above. After the identification of DE genes, it's informative to visualise the expression of specific genes of interest. The `plotCounts()` function applied directly on the `dds` object allows us to examine individual gene expression profiles without accessing the full `res` object. -
- ![counts](./img/plotCounts.png){ width="400" } -
+![counts](../differential_expression_analysis/img/plotCounts.png) In our example, post-treatment, we observe a significant increase in the expression of the _ENSG00000142192_ gene, highlighting its responsiveness to the experimental conditions. Finally, we can create a heatmap using the normalised expression counts of DE genes. The resulting heatmap visualises how the expression of significant genes varies across samples. Each row represents a gene, and each column represents a sample. The color intensity in the heatmap reflects the normalised expression levels: red colors indicate higher expression, while blue colors indicate lower expression. -
- ![heatmap](./img/heatmap_de_genes.png){ width="400" } -
+![heatmap](../differential_expression_analysis/img/heatmap_de_genes.png) By examining the heatmap, we can visually identify the expression patterns of our five significant differentially expressed genes. This visualisation allows us to identify how these genes respond to the treatment. The heatmap provides a clear and intuitive way to explore gene expression dynamics. @@ -80,9 +68,7 @@ By examining the heatmap, we can visually identify the expression patterns of ou Finally, we can attempt to assign biological significance to our differentially expressed genes through **Over Representation Analysis (ORA)**. The ORA analysis identifies specific biological pathways, molecular functions and cellular processes, according to the **Gene Ontology (GO)** database, that are enriched within our differentially expressed genes. -
- ![enrichment](./img/enrichment_plot.png){ width="400" } -
+![enrichment](../differential_expression_analysis/img/enrichment_plot.png) The enrichment analysis reveals a possible involvement of cellular structures and processes, including "clathrin-coated pit", "dendritic spine", "neuron spine" and "endoplasmic reticulum lumen". These terms suggest a focus on cellular transport, structural integrity and protein processing, especially in neural contexts. This pattern points to pathways related to cellular organization and maintenance, possibly playing an important role in the biological condition under study. diff --git a/docs/usage/DEanalysis/index.md b/docs/usage/differential_expression_analysis/introduction.md similarity index 80% rename from docs/usage/DEanalysis/index.md rename to docs/usage/differential_expression_analysis/introduction.md index 56d9d5e24..49437b1a9 100644 --- a/docs/usage/DEanalysis/index.md +++ b/docs/usage/differential_expression_analysis/introduction.md @@ -6,7 +6,9 @@ order: 1 These pages are a tutorial workshop for the [Nextflow](https://www.nextflow.io) pipeline [nf-core/rnaseq](https://nf-co.re/rnaseq). -In this workshop, we will recap the application of next generation sequencing to identify differentially expressed genes. You will learn how to use the rnaseq pipeline to carry out this data-intensive workflow efficiently. We will cover topics such as configuration of the pipeline, code execution and data interpretation. +In this workshop, we will recap the application of next generation sequencing to identify differentially expressed genes. +You will learn how to use the rnaseq pipeline to carry out this data-intensive workflow efficiently. +We will cover topics such as configuration of the pipeline, code execution and data interpretation. Please note that this is not an introductory workshop, and we will assume some basic familiarity with Nextflow. @@ -37,7 +39,9 @@ Now you're all set and can use the following button to launch the service: ## Credits & Copyright -This training material has been written and completed by [Lorenzo Sola](https://github.com/LorenzoS96), [Francesco Lescai](https://github.com/lescai), and [Mariangela Santorsola](https://github.com/msantorsola) during the [nf-core](https://nf-co.re) Hackathon in Barcellona, 2024. Thank you to [Victoria Cepeda](https://github.com/vcepeda) for her contributions to the tutorial's revision. The tutorial is aimed at anyone who is interested in using nf-core pipelines for their studies or research activities. +This training material has been written and completed by [Lorenzo Sola](https://github.com/LorenzoS96), [Francesco Lescai](https://github.com/lescai), and [Mariangela Santorsola](https://github.com/msantorsola) during the [nf-core](https://nf-co.re) Hackathon in Barcellona, 2024. +Thank you to [Victoria Cepeda](https://github.com/vcepeda) for her contributions to the tutorial's revision. +The tutorial is aimed at anyone who is interested in using nf-core pipelines for their studies or research activities. The Docker image and Gitpod environment used in this repository have been created by [Seqera](https://seqera.io) but have been made open-source ([CC BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)) for the community. diff --git a/docs/usage/DEanalysis/rnaseq.md b/docs/usage/differential_expression_analysis/rnaseq.md similarity index 98% rename from docs/usage/DEanalysis/rnaseq.md rename to docs/usage/differential_expression_analysis/rnaseq.md index 9fd0e123e..a19c7128a 100644 --- a/docs/usage/DEanalysis/rnaseq.md +++ b/docs/usage/differential_expression_analysis/rnaseq.md @@ -1,5 +1,6 @@ --- order: 3 +shortTitle: rnaseq pipeline --- # The nf-core/rnaseq pipeline @@ -10,9 +11,7 @@ In order to carry out a RNA-Seq analysis we will use the nf-core pipeline [rnase The pipeline is organised following the diffent blocks shown below: pre-processing, traditional alignment (or lightweight alignment) and quantification, post-processing and final QC. -
- ![metromap](./img/nf-core-rnaseq_metro_map_grey.png){ width="1000"} -
+![metromap](../differential_expression_analysis/img/nf-core-rnaseq_metro_map_grey.png) In each process, the users can choose among a range of different options. Importantly, the users can decide to follow one of the two different routes in the alignment and quantification step: @@ -25,7 +24,7 @@ In each process, the users can choose among a range of different options. Import The number of reads and the number of biological replicates are two critical factors that researchers need to carefully consider during the design of a RNA-seq experiment. While it may seem intuitive that having a large number of reads is always desirable, an excessive number can lead to unnecessary costs and computational burdens, without providing significant improvements. Instead, it is often more beneficial to prioritise the number of biological replicates, as it allows to capture the natural biological variation of the data. Biological replicates involve collecting and sequencing RNA from distinct biological samples (e.g., different individuals, tissues, or time points), helping to detect genuine changes in gene expression. :::warning -This concept must not be confused with technical replicates that asses the technical variability of the sequencing platform by sequencing the same RNA sample multiple time. +This concept must not be confused with technical replicates that asses the technical variability of the sequencing platform by sequencing the same RNA sample multiple times. ::: To obtain optimal results, it is crucial to balance the number of biological replicates and the sequencing depth. While increasing the depth of sequencing enhances the ability to detect genes with low expression levels, there is a plateau beyond which no further benefits are gained. Statistical power calculations can inform experimental design by estimating the optimal number of reads and replicates required. For instance, this approach helps to establish a suitable log2 fold change threshold for the DE analysis. By incorporating multiple biological replicates into the design and optimizing sequencing depth, researchers can enhance the statistical power of the analysis, reducing the number of false positive results, and increasing the reliability of the findings. diff --git a/docs/usage/DEanalysis/theory.md b/docs/usage/differential_expression_analysis/theory.md similarity index 97% rename from docs/usage/DEanalysis/theory.md rename to docs/usage/differential_expression_analysis/theory.md index dd932fbb6..acc90c10d 100644 --- a/docs/usage/DEanalysis/theory.md +++ b/docs/usage/differential_expression_analysis/theory.md @@ -12,9 +12,7 @@ Given the central role of RNA in a wide range of molecular functions, RNA-seq ha After RNA extraction and reverse transcription into complementary DNA (cDNA), the biological material is sequenced, generating NGS "reads" that correspond to the RNA captured in a specific cell, tissue, or organ at a given time. The sequencing data is then bioinformatically processed through a typical workflow summarised in the diagram below: -
- ![excalidraw](./img/Excalidraw_RNAseq.png){ width="1000" } -
+![excalidraw](../differential_expression_analysis/img/Excalidraw_RNAseq.png) In the scheme, we can identify three key phases in the workflow: @@ -94,15 +92,11 @@ The results will not be affected by the order of variables but the common practi RNA-seq data typically contain a large number of genes with low expression counts, indicating that many genes are expressed at very low levels across samples. At the same time, RNA-seq data exhibit a skewed distribution with a long right tail due to the absence of an upper limit for gene expression levels. This means that while most genes have low to moderate expression levels, a small number are expressed at high levels. Accurate statistical modelling must therefore account for this distribution to avoid misleading conclusions. -
- ![count_distribution](./img/count_distribution.png){ width="400"} -
+![count_distribution](../differential_expression_analysis/img/count_distribution.png) The core of the differential expression analysis is the `DESeq()` function, a wrapper that streamlines several key steps into a single command. The different functions are listed below: -
- ![deseq2_function](./img/DESeq_function.png){ width="400"} -
+![deseq2_function](../differential_expression_analysis/img/DESeq_function.png) :::note While `DESeq()` combines these steps, a user could choose to perform each function separately to have more control over the whole process. @@ -128,9 +122,7 @@ While normalised counts are useful for downstream visualisation of results, they 2. **Estimate dispersion and gene-wise dispersion**: the dispersion is a measure of how much the variance deviates from the mean. The dispersion estimates indicate the variance in gene expression at a specific mean expression level. Importantly, RNA-seq data are characterised by **overdispersion**, where the variance in gene expression levels often exceeds the mean (variance > mean). -
- ![overdispersion](./img/overdispersion.png){ width="400"} -
+![overdispersion](../differential_expression_analysis/img/overdispersion.png) DESeq2 addresses this issue by employing the **negative binomial distribution**, which generalises the Poisson distribution by introducing an additional dispersion parameter. This parameter quantifies the extra variability present in RNA-seq data, providing a more realistic representation than the Poisson distribution, which assumes mean = variance. DESeq2 starts by estimating the **common dispersion**, a single estimate of dispersion applicable to all genes in the dataset. This estimate provides a baseline for variability across all genes in the dataset. Next, DESeq2 estimates **gene-wise dispersion**, a separate estimate of dispersion for each individual gene, taking into account that different genes may exhibit varying levels of expression variability due to biological differences. The dispersion parameter (α) is related to the mean (μ), and variance of the data, as described by the equation: @@ -145,9 +137,7 @@ A key feature of DESeq2's dispersion estimates is their negative correlation wit 4. **Final dispersion estimates**: DESeq2 refines the gene-wise dispersion by shrinking it towards the fitted curve. The "shrinkage" helps control for overfitting, and makes the dispersion estimates more reliable. The strength of the shrinkage depends on the sample size (more samples = less shrinkage), and how close the initial estimates are to the fitted curve. -
- ![dispersion](./img/dispersion_estimates.png){ width="400"} -
+![dispersion](../differential_expression_analysis/img/dispersion_estimates.png) The initial estimates (black dots) are shrunk toward the fitted curve (red line) to obtain the final estimates (blue dots). However, genes with exceptionally high dispersion values are not shrunk, as they likely deviate from the model assumptions exhibiting elevated variability due to biological or technical factors. Shrinking these values could lead to false positives. @@ -166,4 +156,4 @@ To account for this, DESeq2 employs multiple test correction methods (the Benjam By setting the FDR cutoff to < 0.05, 5% of genes identified as differentially expressed are expected to be false positives. For instance, if 400 genes are identified as differentially expressed with an FDR cutoff of 0.05, you would expect 20 of them to be false positives. ::: -After identifying DE genes using DESeq2, it is essential to interpret the biological significance of these genes through functional analysis. This involves examining the roles of the differentially expressed genes in various biological processes, molecular functions, and pathways, providing insights into the underlying mechanisms driving the observed changes in gene expression. This interpretation can help in discovering pathways involved in disease or identifying potential therapeutic targets. Different tools are available to carry out these functional analyses, such as [Gene Ontology](https://geneontology.org), [Reactome](https://reactome.org/), [KEGG](https://www.genome.jp/kegg), [clusterProfiler](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html), [g:Profiler](https://biit.cs.ut.ee/gprofiler), and [WikiPathways](https://www.wikipathways.org). +After identifying DE genes using DESeq2, it is essential to interpret the biological significance of these genes through functional analysis. This involves examining the roles of the differentially expressed genes in various biological processes, molecular functions, and pathways, providing insights into the underlying mechanisms driving the observed changes in gene expression. This interpretation can help in discovering pathways involved in disease or identifying potential therapeutic targets. Different tools are available to carry out these functional analyses, such as [Gene Ontology](https://geneontology.org), [Reactome](https://reactome.org/), [KEGG](https://www.genome.jp/kegg), [clusterProfiler](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html), [g\:Profiler](https://biit.cs.ut.ee/gprofiler), and [WikiPathways](https://www.wikipathways.org).