Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix rendering of DE tutorial #1493

Merged
merged 6 commits into from
Jan 24, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
order: 4
shortTitle: RStudio
---

# Differential Analysis with DESeq2
Expand Down Expand Up @@ -33,9 +34,7 @@ As in all analysis, firstly we need to create a new project:

2. Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**;

<figure markdown="span">
![r_project](./img/project_R.png){ width="400" }
</figure>
![r_project](../differential_expression_analysis/img/project_R.png)

3. The new project will be automatically opened in RStudio.

Expand All @@ -48,9 +47,7 @@ To store our results in an organized way, we will create a folder named **de_res

and save the file as **de_script.R**. From now on, each command described in the tutorial can be added to your script. The resulting working directory should look like this:

<figure markdown="span">
![work_dir](./img/workdir_RStudio.png){ width="600" }
</figure>
![work_dir](../differential_expression_analysis/img/workdir_RStudio.png)

The analysis requires several R packages. To utilise them, we need to load the following libraries:

Expand Down Expand Up @@ -162,9 +159,7 @@ design(dds_new) # to check the design formula

Comparing the structure of the newly created dds (`dds_new`) with the one automatically produced by the pipeline (`dds`), we can observe the differences:

<figure markdown="span">
![comparison_dds](./img/dds_comparison.png){ width="400" }
</figure>
![comparison_dds](../differential_expression_analysis/img/dds_comparison.png)

Before running the different steps of the analysis, a good practice consists in pre-filtering the genes to remove those with very low counts. This is useful to improve computional efficiency and enhance interpretability. In general, it is reasonable to keep only genes with a sum counts of at least 10 for a minimal number of 3 samples:

Expand Down Expand Up @@ -438,7 +433,7 @@ plotCounts(dds_final, gene = "ENSG00000142192")
dev.off()
```

**heatmap**: plot of the normalised counts for all the significant genes obtained with the `pheatmap()` function. The heatmap provides insights into genes and sample relationships that may not be apparent from individual gene plots alone.
- **heatmap**: plot of the normalised counts for all the significant genes obtained with the `pheatmap()` function. The heatmap provides insights into genes and sample relationships that may not be apparent from individual gene plots alone.

```r
#### Heatmap ####
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,13 @@ The results illustrated in this section might show slight variations compared to

The first plot we will examine is the Principal Component Analysis (PCA) plot. Since we're working with simulated data, our metadata is relatively simple, consisting of just three variables: `sample`, `condition`, and `replica`. In a typical RNA-seq experiment, however, metadata can be complex and encompass a wide range of variables that could contribute to sample variation, such as sex, age, and developmental stage.

<figure markdown="span">
![pca](./img/pca_plot.png){ width="400" }
</figure>
![pca](../differential_expression_analysis/img/pca_plot.png)

By plotting the PCA on the PC1 and PC2 axes, using `condition` as the main variable of interest, we can quickly identify the primary source of variation in our data. By accounting for this variation in our design model, we should be able to detect more differentially expressed genes related to `condition`. When working with real data, it's often useful to plot the data using different variables to explore how much variation is explained by the first two PCs. Depending on the results, it may be informative to examine variation on additional PC axes, such as PC3 and PC4, to gain a more comprehensive understanding of the data.

Next, we will examine the hierarchical clustering plot to explore the relationships between samples based on their gene expression profiles. The heatmap is organized such that samples with similar expression profiles are close to each other, allowing us to identify patterns in the data.

<figure markdown="span">
![cluster](./img/hierarchical_clustering.png){ width="400" }
</figure>
![cluster](../differential_expression_analysis/img/hierarchical_clustering.png)

Remember that to create this plot, we utilized the `dist()` function, so in the legend on the right, a value of 0 corresponds to high correlation, while a value of 5 corresponds to very low correlation. Similar to PCA, we can see that samples tend to cluster together according to `condition`, indeed we can observe a high degree of correlation between the three control samples and between the three treated samples.

Expand All @@ -35,11 +31,9 @@ Overall, the integration of these plots suggests that we are working with high-q
In this part of the tutorial, we will examine plots that are generated after the differential expression analysis. These plots are not quality control plots, but rather plots that help us to interpret the results.
After running the `results()` function, a good way to start to have an idea about the results is to look at the MA plot.

<figure markdown="span">
![ma_plot](./img/MA_plot.png){ width="500" }
</figure>
![ma_plot](../differential_expression_analysis/img/MA_plot.png)

By default, genes are coloured in blue if the padj is less than 0.1 and the log2 fold change greater than or less than 0. Genes that fall outside the plotting region are represented as open triangles. At this stage, we have not yet applied a filter to select only significant DE genes, which we define as those with a padj value less than 0.5 and a log2 fold change of at least 1 or -1.
By default, genes are coloured in blue if the padj is less than 0.1 and the log2 fold change greater than or less than 0. Genes that fall outside the plotting region are represented as open triangles. At this stage, we have not yet applied a filter to select only significant DE genes, which we define as those with a padj value less than 0.05 and a log2 fold change of at least 1 or -1.

After filtering our genes of interest according to our threshold, let's have a look to our significatnt genes:

Expand All @@ -54,35 +48,27 @@ ENSG00000156282 481.7624 1.095272 0.2969594 3.688289

To gain a comprehensive overview of the transcriptional profile, the volcano plot represents a highly informative tool.

<figure markdown="span">
![volcano_plot](./img/volcanoplot.png){ width="400"}
</figure>
![volcano_plot](../differential_expression_analysis/img/volcanoplot.png)

The treatment induced differential expression in five genes: one downregulated and four upregulated. This plot visually represents the numerical results reported in the table above.

After the identification of DE genes, it's informative to visualise the expression of specific genes of interest. The `plotCounts()` function applied directly on the `dds` object allows us to examine individual gene expression profiles without accessing the full `res` object.

<figure markdown="span">
![counts](./img/plotCounts.png){ width="400" }
</figure>
![counts](../differential_expression_analysis/img/plotCounts.png)

In our example, post-treatment, we observe a significant increase in the expression of the _ENSG00000142192_ gene, highlighting its responsiveness to the experimental conditions.

Finally, we can create a heatmap using the normalised expression counts of DE genes. The resulting heatmap visualises how the expression of significant genes varies across samples. Each row represents a gene, and each column represents a sample. The color intensity in the heatmap reflects the normalised expression levels: red colors indicate higher expression, while blue colors indicate lower expression.

<figure markdown="span">
![heatmap](./img/heatmap_de_genes.png){ width="400" }
</figure>
![heatmap](../differential_expression_analysis/img/heatmap_de_genes.png)

By examining the heatmap, we can visually identify the expression patterns of our five significant differentially expressed genes. This visualisation allows us to identify how these genes respond to the treatment. The heatmap provides a clear and intuitive way to explore gene expression dynamics.

## Over Representation Analysis (ORA)

Finally, we can attempt to assign biological significance to our differentially expressed genes through **Over Representation Analysis (ORA)**. The ORA analysis identifies specific biological pathways, molecular functions and cellular processes, according to the **Gene Ontology (GO)** database, that are enriched within our differentially expressed genes.

<figure markdown="span">
![enrichment](./img/enrichment_plot.png){ width="400" }
</figure>
![enrichment](../differential_expression_analysis/img/enrichment_plot.png)

The enrichment analysis reveals a possible involvement of cellular structures and processes, including "clathrin-coated pit", "dendritic spine", "neuron spine" and "endoplasmic reticulum lumen". These terms suggest a focus on cellular transport, structural integrity and protein processing, especially in neural contexts. This pattern points to pathways related to cellular organization and maintenance, possibly playing an important role in the biological condition under study.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ order: 1

These pages are a tutorial workshop for the [Nextflow](https://www.nextflow.io) pipeline [nf-core/rnaseq](https://nf-co.re/rnaseq).

In this workshop, we will recap the application of next generation sequencing to identify differentially expressed genes. You will learn how to use the rnaseq pipeline to carry out this data-intensive workflow efficiently. We will cover topics such as configuration of the pipeline, code execution and data interpretation.
In this workshop, we will recap the application of next generation sequencing to identify differentially expressed genes.
You will learn how to use the rnaseq pipeline to carry out this data-intensive workflow efficiently.
We will cover topics such as configuration of the pipeline, code execution and data interpretation.

Please note that this is not an introductory workshop, and we will assume some basic familiarity with Nextflow.

Expand Down Expand Up @@ -37,7 +39,9 @@ Now you're all set and can use the following button to launch the service:

## Credits & Copyright

This training material has been written and completed by [Lorenzo Sola](https://github.com/LorenzoS96), [Francesco Lescai](https://github.com/lescai), and [Mariangela Santorsola](https://github.com/msantorsola) during the [nf-core](https://nf-co.re) Hackathon in Barcellona, 2024. Thank you to [Victoria Cepeda](https://github.com/vcepeda) for her contributions to the tutorial's revision. The tutorial is aimed at anyone who is interested in using nf-core pipelines for their studies or research activities.
This training material has been written and completed by [Lorenzo Sola](https://github.com/LorenzoS96), [Francesco Lescai](https://github.com/lescai), and [Mariangela Santorsola](https://github.com/msantorsola) during the [nf-core](https://nf-co.re) Hackathon in Barcellona, 2024.
Thank you to [Victoria Cepeda](https://github.com/vcepeda) for her contributions to the tutorial's revision.
The tutorial is aimed at anyone who is interested in using nf-core pipelines for their studies or research activities.

The Docker image and Gitpod environment used in this repository have been created by [Seqera](https://seqera.io) but have been made open-source ([CC BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)) for the community.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
order: 3
shortTitle: rnaseq pipeline
---

# The nf-core/rnaseq pipeline
Expand All @@ -10,9 +11,7 @@ In order to carry out a RNA-Seq analysis we will use the nf-core pipeline [rnase

The pipeline is organised following the diffent blocks shown below: pre-processing, traditional alignment (or lightweight alignment) and quantification, post-processing and final QC.

<figure markdown="span">
![metromap](./img/nf-core-rnaseq_metro_map_grey.png){ width="1000"}
</figure>
![metromap](../differential_expression_analysis/img/nf-core-rnaseq_metro_map_grey.png)

In each process, the users can choose among a range of different options. Importantly, the users can decide to follow one of the two different routes in the alignment and quantification step:

Expand All @@ -25,7 +24,7 @@ In each process, the users can choose among a range of different options. Import
The number of reads and the number of biological replicates are two critical factors that researchers need to carefully consider during the design of a RNA-seq experiment. While it may seem intuitive that having a large number of reads is always desirable, an excessive number can lead to unnecessary costs and computational burdens, without providing significant improvements. Instead, it is often more beneficial to prioritise the number of biological replicates, as it allows to capture the natural biological variation of the data. Biological replicates involve collecting and sequencing RNA from distinct biological samples (e.g., different individuals, tissues, or time points), helping to detect genuine changes in gene expression.

:::warning
This concept must not be confused with technical replicates that asses the technical variability of the sequencing platform by sequencing the same RNA sample multiple time.
This concept must not be confused with technical replicates that asses the technical variability of the sequencing platform by sequencing the same RNA sample multiple times.
:::

To obtain optimal results, it is crucial to balance the number of biological replicates and the sequencing depth. While increasing the depth of sequencing enhances the ability to detect genes with low expression levels, there is a plateau beyond which no further benefits are gained. Statistical power calculations can inform experimental design by estimating the optimal number of reads and replicates required. For instance, this approach helps to establish a suitable log2 fold change threshold for the DE analysis. By incorporating multiple biological replicates into the design and optimizing sequencing depth, researchers can enhance the statistical power of the analysis, reducing the number of false positive results, and increasing the reliability of the findings.
Expand Down
Loading
Loading