diff --git a/docs/usage/variantcalling/img/bqsr.excalidraw.svg b/docs/usage/variantcalling/img/bqsr.excalidraw.svg
new file mode 100644
index 0000000000..4dc41bd095
--- /dev/null
+++ b/docs/usage/variantcalling/img/bqsr.excalidraw.svg
@@ -0,0 +1,17 @@
+
\ No newline at end of file
diff --git a/docs/usage/variantcalling/img/clinvar_results.png b/docs/usage/variantcalling/img/clinvar_results.png
new file mode 100644
index 0000000000..5e88f3d961
Binary files /dev/null and b/docs/usage/variantcalling/img/clinvar_results.png differ
diff --git a/docs/usage/variantcalling/img/clinvar_search.png b/docs/usage/variantcalling/img/clinvar_search.png
new file mode 100644
index 0000000000..df8c58d84d
Binary files /dev/null and b/docs/usage/variantcalling/img/clinvar_search.png differ
diff --git a/docs/usage/variantcalling/img/gnomAD_COL6A1_v2.1.png b/docs/usage/variantcalling/img/gnomAD_COL6A1_v2.1.png
new file mode 100644
index 0000000000..2fbc497fe0
Binary files /dev/null and b/docs/usage/variantcalling/img/gnomAD_COL6A1_v2.1.png differ
diff --git a/docs/usage/variantcalling/img/gnomAD_COL6A1_v4.0.png b/docs/usage/variantcalling/img/gnomAD_COL6A1_v4.0.png
new file mode 100644
index 0000000000..ee148dba85
Binary files /dev/null and b/docs/usage/variantcalling/img/gnomAD_COL6A1_v4.0.png differ
diff --git a/docs/usage/variantcalling/img/gnomAD_constraint.png b/docs/usage/variantcalling/img/gnomAD_constraint.png
new file mode 100644
index 0000000000..9183272073
Binary files /dev/null and b/docs/usage/variantcalling/img/gnomAD_constraint.png differ
diff --git a/docs/usage/variantcalling/img/gnomad_search.png b/docs/usage/variantcalling/img/gnomad_search.png
new file mode 100644
index 0000000000..c5577353cb
Binary files /dev/null and b/docs/usage/variantcalling/img/gnomad_search.png differ
diff --git a/docs/usage/variantcalling/img/gnomad_var_present.png b/docs/usage/variantcalling/img/gnomad_var_present.png
new file mode 100644
index 0000000000..5e42034bc2
Binary files /dev/null and b/docs/usage/variantcalling/img/gnomad_var_present.png differ
diff --git a/docs/usage/variantcalling/img/interpretation.excalidraw.svg b/docs/usage/variantcalling/img/interpretation.excalidraw.svg
new file mode 100644
index 0000000000..dd9c7f6a3a
--- /dev/null
+++ b/docs/usage/variantcalling/img/interpretation.excalidraw.svg
@@ -0,0 +1,17 @@
+
\ No newline at end of file
diff --git a/docs/usage/variantcalling/img/overview.excalidraw.svg b/docs/usage/variantcalling/img/overview.excalidraw.svg
new file mode 100644
index 0000000000..6b17b2f6e5
--- /dev/null
+++ b/docs/usage/variantcalling/img/overview.excalidraw.svg
@@ -0,0 +1,17 @@
+
\ No newline at end of file
diff --git a/docs/usage/variantcalling/img/sarek_subway.png b/docs/usage/variantcalling/img/sarek_subway.png
new file mode 100644
index 0000000000..e2a689b1ca
Binary files /dev/null and b/docs/usage/variantcalling/img/sarek_subway.png differ
diff --git a/docs/usage/variantcalling/interpretation.md b/docs/usage/variantcalling/interpretation.md
new file mode 100644
index 0000000000..0e3b1e3954
--- /dev/null
+++ b/docs/usage/variantcalling/interpretation.md
@@ -0,0 +1,126 @@
+---
+order: 4
+---
+
+# Interpretation
+
+Once variants have been called, the following steps depend on the type of study and the experimental design.
+For large population studies, like case-control association analyses, an appropriate large-scale statistical approach will be chosen and different statistical or analytical tools will be used to carry out the tertiary analyses.
+
+When only a few individuals are involved, and in particular in clinical contexts, the goal will be to interpret the findings in light of different sources of information and pinpoint a causative variant for the investigated phenotype.
+
+## Overview
+
+When variants have been called, and a diagnosis is necessary, investigators will need to combine:
+
+- the predictions resulting from annotations like the one we carried out
+- biological and clinical information
+
+with the goal of narrowing the search space and reducing the number of variants to be inspected.
+This approach is summarised in the diagram below:
+
+![interpretation](./img/interpretation.excalidraw.svg)
+
+Once the list of variants has been reduced, more in-depth analyses of the reported cases and the genomic region in existing databases might be useful to reach a conclusion.
+
+## Finding Causative Variants
+
+Some of these steps might be carried out via software. For this tutorial however, we chose to perform these steps one by one in order to get a better view of the rationale behind this approach.
+
+We will start by looking at the annotated VCF, which is found at this location in our GitPod environment:
+
+```bash
+cd /workspace/gitpod/training/annotation/haplotypecaller/joint_variant_calling
+```
+
+Here, we should verify in which order the two samples we used for this analysis have been written in the VCF file. We can do that by grepping the column names row of the file, and printing at screen the fields from 10th onwards, i.e. the sample columns:
+
+```bash
+zcat joint_germline_recalibrated_snpEff.ann.vcf.gz | grep "#CHROM" | cut -f 10-
+```
+
+This returns:
+
+```bash
+case_case control_control
+```
+
+showing that case variants have been written in field 10th and control variants in field 11th.
+
+Next, in this educational scenario we might assume that an affected individual (case) will carry at least one alternative allele for the causative variant, while the control individual will be a homozygous for the reference.
+With this assumption in mind, and a bit of one-liner code, we could first filter the homozygous for the alternative allele in our case, and then the heterozygous.
+
+In this first one, we can use the following code:
+
+```bash
+zcat joint_germline_recalibrated_snpEff.ann.vcf.gz | grep PASS | grep HIGH | perl -nae 'if($F[10]=~/0\/0/ && $F[9]=~/1\/1/){print $_;}'
+```
+
+which results in the following variant.
+
+```bash
+chr21 32576780 rs541034925 A AC 332.43 PASS AC=2;AF=0.5;AN=4;DB;DP=94;ExcessHet=0;FS=0;MLEAC=2;MLEAF=0.5;MQ=60;POSITIVE_TRAIN_SITE;QD=33.24;SOR=3.258;VQSLOD=953355.11;culprit=FS;ANN=AC|frameshift_variant|HIGH|TCP10L|ENSG00000242220|transcript|ENST00000300258.8|protein_coding|5/5|c.641dupG|p.Val215fs|745/3805|641/648|214/215||,AC|frameshift_variant|HIGH|CFAP298-TCP10L|ENSG00000265590|transcript|ENST00000673807.1|protein_coding|8/8|c.1163dupG|p.Val389fs|1785/4781|1163/1170|388/389|| GT:AD:DP:GQ:PL 1/1:0,10:10:30:348,30,0 0/0:81,0:81:99:0,119,1600
+```
+
+Now we can search for this variant in the [gnomAD database](https://gnomad.broadinstitute.org), which hosts variants and genomic information from sequencing data of almost one million individuals (see [v4 release](https://gnomad.broadinstitute.org/news/2023-11-gnomad-v4-0/)).
+
+In order to search for the variant we can type its coordinates in the search field and choose the proposed variant corresponding to the exact position we need. See the figure below:
+
+![gnomad search](./img/gnomad_search.png)
+
+the resulting [variant data](https://gnomad.broadinstitute.org/region/21-32576780-32576780?dataset=gnomad_r4) show that our variant is present, and that it's been described already in [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/), where the provided interpretation (Clinical Significance) is "Benign".
+
+We can see the resulting table in the following image:
+
+![gnomad results](./img/gnomad_var_present.png)
+
+Quite importantly, the gnomAD database allows us to gather more information on the gene this variant occurs in. We can inspect the so called "constraint data", by clicking on the gene name and inspecting the "constraint" table on the top right of the page.
+
+![constraint](./img/gnomAD_constraint.png)
+
+This information gives us a better view of the selective pressure variation on this gene might be subject to, and therefore inform our understanding of the potential impact of a loss of function variant in this location.
+
+In this specific case however the gene is not under purifying selection neither for loss of function variants (LOEUF 0.89) nor for missense ones.
+
+We can continue our analysis by looking at the heterozygous variants in our case, for which the control carries a reference homozygous, with the code:
+
+```bash
+zcat joint_germline_recalibrated_snpEff.ann.vcf.gz | grep PASS | grep HIGH | perl -nae 'if($F[10]=~/0\/0/ && $F[9]=~/0\/1/){print $_;}'
+```
+
+This will results in the following list of variants:
+
+```bash
+chr21 44339194 rs769070783 T C 57.91 PASS AC=1;AF=0.25;AN=4;BaseQRankSum=-2.373;DB;DP=84;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.25;MQ=60;MQRankSum=0;POSITIVE_TRAIN_SITE;QD=3.41;ReadPosRankSum=-0.283;SOR=0.859;VQSLOD=198.85;culprit=FS;ANN=C|start_lost|HIGH|CFAP410|ENSG00000160226|transcript|ENST00000397956.7|protein_coding|1/7|c.1A>G|p.Met1?|200/1634|1/1128|1/375||,C|upstream_gene_variant|MODIFIER|ENSG00000232969|ENSG00000232969|transcript|ENST00000426029.1|pseudogene||n.-182T>C|||||182|,C|downstream_gene_variant|MODIFIER|ENSG00000184441|ENSG00000184441|transcript|ENST00000448927.1|pseudogene||n.*3343T>C|||||3343|;LOF=(CFAP410|ENSG00000160226|1|1.00) GT:AD:DP:GQ:PL 0/1:8,9:17:66:66,0,71 0/0:67,0:67:99:0,118,999
+chr21 44406660 rs139273180 C T 35.91 PASS AC=1;AF=0.25;AN=4;BaseQRankSum=-4.294;DB;DP=127;ExcessHet=0;FS=5.057;MLEAC=1;MLEAF=0.25;MQ=60;MQRankSum=0;POSITIVE_TRAIN_SITE;QD=0.51;ReadPosRankSum=0.526;SOR=1.09;VQSLOD=269.00;culprit=FS;ANN=T|stop_gained|HIGH|TRPM2|ENSG00000142185|transcript|ENST00000397932.6|protein_coding|19/33|c.2857C>T|p.Gln953*|2870/5216|2857/4662|953/1553||;LOF=(TRPM2|ENSG00000142185|1|1.00);NMD=(TRPM2|ENSG00000142185|1|1.00) GT:AD:DP:GQ:PL 0/1:48,22:71:44:44,0,950 0/0:51,0:51:99:0,100,899
+chr21 45989090 . C T 43.91 PASS AC=1;AF=0.25;AN=4;BaseQRankSum=2.65;DP=89;ExcessHet=0;FS=4.359;MLEAC=1;MLEAF=0.25;MQ=60;MQRankSum=0;QD=2.58;ReadPosRankSum=-1.071;SOR=1.863;VQSLOD=240.19;culprit=FS;ANN=T|stop_gained|HIGH|COL6A1|ENSG00000142156|transcript|ENST00000361866.8|protein_coding|9/35|c.811C>T|p.Arg271*|892/4203|811/3087|271/1028||;LOF=(COL6A1|ENSG00000142156|1|1.00);NMD=(COL6A1|ENSG00000142156|1|1.00) GT:AD:DP:GQ:PL 0/1:10,7:18:51:52,0,51 0/0:70,0:70:99:0,120,1800
+```
+
+If we search them one by one, we will see that one in particular occurs on a gene (COL6A1) which was previously reported as constrained for loss of function variants in the database version 2.1:
+
+![col6a1v2](./img/gnomAD_COL6A1_v2.1.png)
+
+while the version 4.0 of the database, resulting from almost one million samples, reports the gene as _not_ constrained:
+
+![col6a1v4](./img/gnomAD_COL6A1_v4.0.png)
+
+We can search for this variant in ClinVar by using an advanced search and limiting our search to both chromosome and base position, like indicated in figure below:
+
+![clinvar search](./img/clinvar_search.png)
+
+This will return two results: one deletion and one single nucleotide variant C>T corresponding to the one we called in the case individual:
+
+![clinvar results](./img/clinvar_results.png)
+
+If we click on the nomenclature of the variant we found, we will be able to access the data provided with the submission. In [this page](https://www.ncbi.nlm.nih.gov/clinvar/variation/497373/) we can see that multiple submitters have provided an interpretation for this nonsense mutation (2 stars).
+Under the section "Submitted interpretations and evidence" we can gather additional data on the clinical information that led the submitters to classify the variant as "pathogenic".
+
+## Conclusions
+
+After narrowing down our search and inspecting genomic context and clinical information, we can conclude that the variant
+
+```bash
+chr21 45989090 C T AC=1;AF=0.25;AN=4;BaseQRankSum=2.37;DP=86;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.25;MQ=60;MQRankSum=0;QD=2.99;ReadPosRankSum=-0.737;SOR=1.022;VQSLOD=9.09;culprit=QD;ANN=T|stop_gained|HIGH|COL6A1|ENSG00000142156|transcript|ENST00000361866.8|protein_coding|9/35|c.811C>T|p.Arg271*|892/4203|811/3087|271/1028||;LOF=(COL6A1|ENSG00000142156|1|1.00);NMD=(COL6A1|ENSG00000142156|1|1.00) GT:AD:DP:GQ:PL 0/1:8,6:15:40:50,0,40 0/0:70,0:70:99:0,112,1494
+```
+
+is most likely the causative one, because it creates a premature stop in the COL6A1 gene, with loss of function variants on this gene known to be pathogenic.
diff --git a/docs/usage/variantcalling/introduction.md b/docs/usage/variantcalling/introduction.md
new file mode 100644
index 0000000000..78e9c79075
--- /dev/null
+++ b/docs/usage/variantcalling/introduction.md
@@ -0,0 +1,45 @@
+---
+order: 1
+---
+
+# Variant Calling Tutorial
+
+These pages are a tutorial workshop for the [Nextflow](https://www.nextflow.io) pipeline [nf-core/sarek](https://nf-co.re/sarek).
+
+In this workshop, we will recap the application of next generation sequencing to identify genetic variations in a genome. You will learn how to use the pipeline sarek to carry out this data-intensive workflow efficiently. We will cover topics such as experimental design, configuration of the pipeline and code execution.
+
+Please note that this is not an introductory workshop, and we will assume some basic familiarity with Nextflow.
+
+By the end of this workshop, you will be able to:
+
+- understand the key concepts behind variant calling, as adopted in this pipeline
+- analyse simple NGS datasets with the sarek workflow
+- customise some of its features for your own variant calling analyses
+- integrate different sources of information to identify candidate variants
+- make a hypothesis about variant interpretation using the output of sarek
+
+Let's get started!
+
+## Running with Gitpod
+
+In order to run this using GitPod, please make sure:
+
+1. You have a GitHub account: if not, create one [here](https://github.com/signup)
+2. Once you have a GitHub account, sign up for GitPod using your GitHub user [here](https://gitpod.io/login/) choosing "continue with GitHub".
+
+Now you're all set and can use the following button to launch the service:
+
+[![Open in GitPod](https://img.shields.io/badge/Gitpod-%20Open%20in%20Gitpod-908a85?logo=gitpod)](https://gitpod.io/#https://github.com/lescai-teaching/sarek-tutorial)
+
+## Additional documentation
+
+- You can find detailed documentation on **Nextflow** [here](https://www.nextflow.io/docs/latest/)
+- You can find additional training on [these pages](https://training.nextflow.io)
+
+## Credits & Copyright
+
+This training material has been written by [Francesco Lescai](https://github.com/lescai) during the [nf-core](https://nf-co.re) Hackathon in Barcelona, 2023. It was originally meant as a contribution for the nf-core community, and aimed at anyone who is interested in using nf-core pipelines for their studies or research activities.
+
+The Docker image and Gitpod environment used in this repository have been created by [Seqera](https://seqera.io) but have been made open-source ([CC BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)) for the community.
+
+All examples and descriptions are licensed under the [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/).
diff --git a/docs/usage/variantcalling/sarek.md b/docs/usage/variantcalling/sarek.md
new file mode 100644
index 0000000000..150f01d72a
--- /dev/null
+++ b/docs/usage/variantcalling/sarek.md
@@ -0,0 +1,152 @@
+---
+order: 3
+---
+
+# Using Sarek for Variant Calling
+
+In order to carry out a germline variant calling analysis we will use the nf-core pipeline [sarek](https://nf-co.re/sarek/3.3.2).
+
+## Overview
+
+The pipeline is organised following the three main analysis blocks we previously described: pre-processing, variant calling and annotation.
+
+![sarek_overview](./img/sarek_subway.png)
+
+In each analysis block, the user can choose among a range of different options in terms of aligners, callers and software to carry out the annotation.
+The analysis can also start from different steps, depending the input available and whether it has been partially processed already.
+
+## Experimental Design
+
+In order to choose the different options Sarek offers, the user should collect a few key elements of the experimental design before beginning the analysis.
+
+### Library design
+
+If the experiment used a capture (or targeted) strategy, the user will need to make sure the `bed` file with the target regions is available.
+This file will be useful if the user wants to limit variant calling and annotation to those regions.
+In this case the file can be passed to Sarek command line using the `--intervals target.bed` parameter.
+Should the sequencing strategy be a _whole exome_ or _panel_, the pipeline gives the possibility to enable specific settings for this library type, using the parameter `--wes`.
+
+### Reference genome
+
+nf-core pipelines make use of the Illumina iGenomes collection as [reference genomes](https://nf-co.re/docs/usage/reference_genomes).
+Before starting the analysis, the user might want to check whether the genome they need is part of this collection.
+They also might want to consider downloading the reference locally, when running on premises: this would be useful for multiple runs and to speed up the analysis. In this case the parameter `--igenomes_base` might be used to pass the root directory of the downloaded references.
+
+One might also need to use custom files: in this case the user might either provide specific parameters at command line, or create a config file adding a new section to the `genome` object. See [here](https://nf-co.re/docs/usage/reference_genomes#custom-genomes) for more details.
+
+We will follow this specific approach in this tutorial, since the data we will be using have been simulated on chromosome 21 of the Human GRCh38 reference, and we have prepared fasta, indexes and annotation files containing only this chromosome locally.
+
+### Input files
+
+The input data should be provided in a CSV file, according to a format that is largely common for nf-core pipelines.
+The format is described in the [sarek usage page](https://nf-co.re/sarek/3.3.2/docs/usage#input-sample-sheet-configurations).
+
+## GATK Best Practices
+
+During this tutorial we will use the options Sarek offers to follow the [GATK best practices workflow](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-).
+
+This is solely for educational purposes, since the tutorial dataset includes only 2 samples: while joint-genotyping is a valid choice, the use of soft filtering for such a limited dataset will not offer significant improvements. Additionally, running VQSR on a small dataset will incur in issues with some annotations and will require limiting this step to fewer parameters than usual.
+
+## Running Sarek Germline
+
+In the following sections we will first prepare our references, then set our computational resources in order to be able to run the pipeline on a gitpod VM, edit the filtering settings and finally run the pipeline.
+
+### Reference Genome
+
+Following the considerations above, we will first of all edit the `nextflow.config` file in our working directory to add a new genome.
+It is sufficient to add the following code to the `parameters` directive in the config.
+
+```groovy
+igenomes_base = '/workspace/gitpod/training/data/refs/'
+genomes {
+ 'GRCh38chr21' {
+ bwa = "${params.igenomes_base}/sequence/Homo_sapiens_assembly38_chr21.fasta.{amb,ann,bwt,pac,sa}"
+ dbsnp = "${params.igenomes_base}/annotations/dbsnp_146.hg38_chr21.vcf.gz"
+ dbsnp_tbi = "${params.igenomes_base}/annotations/dbsnp_146.hg38_chr21.vcf.gz.tbi"
+ dbsnp_vqsr = '--resource:dbsnp,known=false,training=true,truth=false,prior=2.0 dbsnp_146.hg38_chr21.vcf.gz'
+ dict = "${params.igenomes_base}/sequence/Homo_sapiens_assembly38_chr21.dict"
+ fasta = "${params.igenomes_base}/sequence/Homo_sapiens_assembly38_chr21.fasta"
+ fasta_fai = "${params.igenomes_base}/sequence/Homo_sapiens_assembly38_chr21.fasta.fai"
+ germline_resource = "${params.igenomes_base}/annotations/gnomAD.r2.1.1.GRCh38.PASS.AC.AF.only_chr21.vcf.gz"
+ germline_resource_tbi = "${params.igenomes_base}/annotations/gnomAD.r2.1.1.GRCh38.PASS.AC.AF.only_chr21.vcf.gz.tbi"
+ known_snps = "${params.igenomes_base}/annotations/1000G_phase1.snps.high_confidence.hg38_chr21.vcf.gz"
+ known_snps_tbi = "${params.igenomes_base}/annotations/1000G_phase1.snps.high_confidence.hg38_chr21.vcf.gz.tbi"
+ known_snps_vqsr = '--resource:1000G,known=false,training=true,truth=true,prior=10.0 1000G_phase1.snps.high_confidence.hg38_chr21.vcf.gz'
+ known_indels = "${params.igenomes_base}/annotations/Mills_and_1000G_gold_standard.indels.hg38_chr21.vcf.gz"
+ known_indels_tbi = "${params.igenomes_base}/annotations/Mills_and_1000G_gold_standard.indels.hg38_chr21.vcf.gz.tbi"
+ known_indels_vqsr = '--resource:mills,known=false,training=true,truth=true,prior=10.0 Mills_and_1000G_gold_standard.indels.hg38_chr21.vcf.gz'
+ snpeff_db = '105'
+ snpeff_genome = 'GRCh38'
+ }
+}
+```
+
+### Computing resources
+
+Based on the choices we made when starting up the gitpod environment, we recommend to use the following additional parameters.
+They can also be added to the parameters directive in the config file we just edited.
+
+```groovy
+params {
+ max_cpus = 2
+ max_memory = '6.5GB'
+ max_time = '2.h'
+ use_annotation_cache_keys = true
+}
+```
+
+The parameter `use_annotation_cache_keys` allows the annotation software to deal with the local paths when the cache is downloaded on the environment.
+
+### Filtering parameters
+
+As we mentioned earlier, we will be using the VQSR filtering tool once the variants have been called.
+However, this tool should be used to take advantage of larger amount of variant annotations and improve filtering: when a small tutorial dataset is used, some of the annotations will not have sufficient data or might even have no variance.
+In order to account for this, we have to change the filtering options and limit this approach to a subset of variant annotations.
+
+We can do this by editing the process descriptors for the Sarek modules running VQSR for both single nucleotide variants and insertion/deletions.
+
+```groovy
+process {
+ withName: 'VARIANTRECALIBRATOR_INDEL' {
+ ext.prefix = { "${meta.id}_INDEL" }
+ ext.args = "-an QD -an FS -an SOR -an DP -mode INDEL"
+ publishDir = [
+ enabled: false
+ ]
+ }
+
+ withName: 'VARIANTRECALIBRATOR_SNP' {
+ ext.prefix = { "${meta.id}_SNP" }
+ ext.args = "-an QD -an MQ -an FS -an SOR -mode SNP"
+ publishDir = [
+ enabled: false
+ ]
+ }
+}
+```
+
+### Launching the pipeline
+
+Now we are ready to launch the pipeline, and we can use the following command line:
+
+```bash
+nextflow run nf-core/sarek -r 3.4.0 \
+--input /workspace/gitpod/training/data/reads/sarek-input.csv \
+--outdir . \
+--tools haplotypecaller,snpeff \
+--genome GRCh38chr21 \
+--joint_germline \
+--intervals /workspace/gitpod/training/exome_target_hg38_chr21.bed \
+--wes
+```
+
+Notice that we have selected `--joint_germline` to enable the joint-genotyping workflow, we have specified our library strategy is using a capture with `--wes` and we have provided a bed file with the targets with `--intervals`.
+The target file in this case refers to the capture intervals on chromosome 21 only, where the data have been simulated.
+
+The whole pipeline from FASTQ input to annotated VCF should run in about 25 minutes.
+
+Our final VCF file will be located in
+
+```bash
+./annotation/haplotypecaller/joint_variant_calling
+```
diff --git a/docs/usage/variantcalling/theory.md b/docs/usage/variantcalling/theory.md
new file mode 100644
index 0000000000..e8e8a14b80
--- /dev/null
+++ b/docs/usage/variantcalling/theory.md
@@ -0,0 +1,140 @@
+---
+order: 1
+---
+
+# Calling Variants on Sequencing Data
+
+Before we dive into one of the nf-core pipelines used for variant calling, it's worth looking at some theoretical aspects of variant calling.
+
+## Overview
+
+The term "variant calling" is rooted in the history of DNA sequencing, and it indicates an approach where we identify (i.e. call) positions in a genome (loci) which are variable in a population (genetic variants). The specific genotype of an individual at that variant locus is then assigned.
+
+There are many different approaches for calling variants from sequencing data: here, we will look more specifically at a reference-based variant calling approach, i.e. where a reference genome is needed and variant sites are identified by comparing the reads to this reference.
+
+Over the years, also thanks to the work carried out by the [GATK team](https://gatk.broadinstitute.org/hc/en-us) at the Broad Institute, there has been a convergence on a "best practices" workflow, which is summarised in the diagram below:
+
+![overview](./img/overview.excalidraw.svg)
+
+In this scheme we can identify a few key phases in the workflow. Pre-processing is the first part, where raw data are handled and mapped to a genome reference, to be then transformed in order to increase the accuracy of the following analyses. Then, variant calling is carried out. This is followed by filtering and annotation.
+Here we will briefly discuss these key steps, which might vary depending on the specific type of data one is performing variant calling on.
+
+## Alignment
+
+The alignment step is where reads obtained from genome fragments of a sample are identified as originating from a specific location in the genome.
+This step is essential in a reference-based workflow, because it is the comparison of the raw data with the reference to inform us on whether a position in the genome might be variable or not.
+
+Mismatches, insertions and deletions (INDELs) as well as duplicated regions make this step sometimes challenging: this is the reason why an appropriate aligner has to be chosen, depending on the sequencing application and data type.
+
+Once each raw read has been aligned to the region of the genome it is most likely originating from, the sequence of all reads overlapping each locus can be used to identify potentially variable sites. Each read will support the presence of an allele identical to the reference, or a different one (alternative allele), and the variant calling algorithm will measure the weighted support for each allele.
+
+However, the support given by the raw data to alternative variants might be biased. For this reason, one can apply certain corrections to the data to ensure the support for the alleles is assessed correctly. This is done by performing the two steps described below: marking duplicates, and recalibrating base quality scores.
+
+## Marking Duplicates
+
+Duplicates are non-independent measurements of a sequence fragment.
+
+Since DNA fragmentation is theoretically random, reads originating from different fragments provide independent information. An algorithm can use this information to assess the support for different alleles.
+When these measurements however are not independent, the raw data might provide a biased support towards a specific allele.
+
+Duplicates can be caused by PCR during library preparation (library duplicates) or might occur during sequencing, when the instrument is reading the signal from different clusters (as in Illumina short read sequencing). These latter are called "optical duplicates".
+
+A specific step called "marking duplicates" identifies these identical pairs using their orientation and 5' position (before any clipping), which will be assumed to be coming from the same input DNA template: one representative pair is then chosen based on quality scores and other criteria, while the other ones are marked.
+Marked reads are then ignored in the following steps.
+
+## Base Quality Score Recalibration
+
+Among the parameters used by a variant calling algorithm to weigh the support for different alleles, the quality score of the base in the read at the variant locus is quite important.
+Sequencing instruments, however, can make systematic errors when reading the signal at each cycle, and cannot account for errors originated in PCR.
+
+Once a read has been aligned to the reference, an appropriate algorithm can however compare the error rate estimated from the existing base quality scores, with the actual differences observed with the reference sequence (empirical quality), and perform appropriate corrections.
+This process is called "base quality score recalibration" (BQSR).
+
+To calculate empirical qualities, the algorithm simply counts the number of mismatches in the observed bases. Any mismatch which does not overlap a known variant is considered an error. The empirical error rate is simply the ratio between counted errors and the total observed bases.
+A Yates correction is applied to this, to avoid either dividing by 0 or dealing with small counts.
+
+$$
+e_{empirical} = \frac{n_{mismatches} + 1}{n_{bases} +2}
+$$
+
+The empirical error is expressed as a Quality in Phred-scale:
+
+$$
+Q_{empirical} = -10 \times log_{10}(e_{empirical})
+$$
+
+Let's use a simple example like the one in the diagram below, where for illustrative purposes we only consider the bases belonging to the same read.
+
+![bqsr](./img/bqsr.excalidraw.svg)
+
+In this example we have 3 mismatches, but one is a reported variant site: we therefore only count 2 errors, over 10 observed bases. According to the approach we just explained,
+
+$$
+Q_{empirical} = -10 \times log_{10}(\frac{2 + 1}{10 +2}) = 6.29
+$$
+
+To calculate the average reported Q score, we should sum the error probabilities and then convert them back into phred scale:
+
+$$
+Q_{average} = -10 \times log_{10}(\frac {0.1 + 0.1 + 0.01 + 0.1 + 0.01 + 0.01 + 0.01 + 0.1 + 0.1 + 0.1}{10}) = 11.94
+$$
+
+Our empirical Q score would be 6.29, the average reported Q score is 11.94, and therefore the $\Delta = 11.94 - 6.29 = 5.65$
+
+The recalibrated Q score of each base would correspond to the reported Q score minus this $\Delta$.
+
+In a real sequencing dataset, this calculation is performed for different groups (bins) of bases: those in the same lane, those with the same original quality score, per machine cycle, per sequencing context.
+In each bin, the difference ($\Delta$) between the average reported quality and the empirical quality is calculated.
+The recalibrated score would then be the reported score minus the sum of all deltas calculated in each bin the base belongs to.
+
+A detailed summary of this approach can be found on the [GATK BQSR page](https://gatk.broadinstitute.org/hc/en-us/articles/360035890531-Base-Quality-Score-Recalibration-BQSR-). We also found quite useful this [step by step guide](https://rstudio-pubs-static.s3.amazonaws.com/64456_4778547202f24f32b0edc325e96b061a.html) through the matematical approach. Full details are explained in the [publication](https://www.nature.com/articles/ng.806) that first proposed this method.
+
+## Calling Variants
+
+Once we have prepared the data for an accurate identification of the variants, we are ready to perform the next steps.
+The most important innovation introduced some years ago in this part of the workflow, has been to separate the identification of a variant site (i.e. variant calling itself) from the assignment of the genotype to each individual.
+This approach makes the computation more approachable, especially for large sample cohorts: BAM files are only accessed per-sample in the first step, while multi-sample cohort data are used together in the second step in order to increase the accuracy of genotype assignment.
+
+### Identifying Variants
+
+In this phase, which is performed on each sample independently, a first step uses a sliding window to count differences compared to the reference (i.e. mismatches, INDELs) and potentially variable regions are identified. GATK calls these "active regions".
+Then, a local graph assembly of the reads is created to identify plausible haplotypes, which are aligned to the reference with a traditional alignment algorithm called "Smith-Waterman": this is used to identify variants.
+For each read in an active region, the support for each of the haplotypes is counted and a likelihood score for each combination of read/haplotype is calculated.
+The likelihoods at this step allow to calculate the support for each of the alleles in a variant site, and read-haplotype likelihoods are a key input for the Bayesian statistics used to determine the most likely genotype.
+This first genotype assignment could be sufficient if one analysed a single sample only.
+
+### Assigning Genotypes
+
+When multiple samples are analysed, information from each of them could collectively improve the genotype assignment.
+This is because the magnitude of potential biases (example: strand bias) can be better estimated, and because the distributions of those annotations used to inform the genotype assignment become more stable when more data are available, by combining multiple samples.
+The use of a larger cohort also increases the sensitivity.
+
+This is possible if the variant calling step is run by producing a variation of the VCF file format called GVCF: this format includes, in addition to variant sites, also non-variant intervals in the genome of each sample. Moreover, it reports probability likelihoods of a non-reference symbolic allele at these non-variant intervals.
+This information allows to re-genotype each sample by using data from the whole cohort.
+
+You can read more on the GATK website about the [logic of joint calling](https://gatk.broadinstitute.org/hc/en-us/articles/360035890431-The-logic-of-joint-calling-for-germline-short-variants).
+
+### Filtering Variants
+
+There are several ways to spot potential false positives through filtering.
+
+_Hard filtering_ uses pre-defined thresholds of different variant annotations (allele-depth, mapping quality and many others) in order to flag variants passing all these criteria, and those failing to meet any of them. This approach is mostly useful when calling a few samples and enough data are not available for more sophisticated solutions.
+
+_Soft filtering_ infers the thresholds to be applied from the data themselves. This approach uses the distributions of the annotations, and their overlap with known and validated variants: it defines those combinations of annotations which are more likely to describe true positives (the variants they refer to in the analysis cohort overlap with those validated in other databases). This approach is used by a GATK tool called Variant Quality Score Recalibration (VQSR).
+
+More details can be found on the [GATK VQSR page](https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR-).
+
+More recently, pre-trained deep learning models are also available to filter variants based on neural network architectures trained on a large number of variants from population databases.
+
+## Annotation
+
+Once the analysis has produced a final VCF file, the final step which is necessary to interpret the results is called "annotation".
+This step uses different databases to describe (annotate) each variant from a genomic, biological, or population point of view.
+The software used to carry out this task will add information to the VCF file such as:
+
+- the gene each variant overlaps with
+- the transcript the variant overlaps with
+- the potential biological consequence on each of those transcripts
+- population frequency (minor allele frequency, described in different databases such as gnomAD)
+
+And several other items we can use to interpret our findings from a biological or clinical point of view.