You must be signed in to change notification settings - Fork 7
6. Test datasets
A test dataset composed of Sumatran rhinoceros (Dicerorhinus sumatrensis) whole-genome re-sequencing data from historical and modern samples, several reference FASTA files, a GTF file with annotations, outgroup genomes and a dated phylogenetic tree for GERP is available from the Scilifelab Data Repository (DOI: 10.17044/scilifelab.19248172.v2). The included data corresponds to scaffold ‘Sc9M7eS_2_HRSCAF_41’ from the Sumatran rhinoceros genome (GenBank accession number GCA_014189135.1) and can be used to get familiar with running GenErode on a high-performance compute cluster.
Workflow descriptions and all scripts used to generate the test dataset are provided in this GitHub repository (docs/extras/test_dataset_generation
Note that the phylogenetic tree provided with the test dataset that was used in Kutschera et al. 2022 is scaled to billions of years whereas the recommended scale is millions of years.
Detailed description on how the pipeline was run with the test dataset in Kutschera et al. 2022 (GenErode version 0.4.1)
The following steps were performed to analyze the test dataset with GenErode on a HPC cluster with slurm that has both Conda and Singularity installed on the system. The pipeline was run three times with different settings that are described below.
- Cloned the GitHub repository to a directory on the HPC cluster with
git clone https://github.com/NBISweden/GenErode.git
and moved into the directoryGenErode/
- Generated a conda environment with Snakemake and other required libraries from the
file withconda env create -n generode -f environment.yml
(only if the conda environment had not been created yet) - Created a slurm profile for the pipeline run:
a) Installed cookiecutter with
conda install -c conda-forge cookiecutter
(only if cookiecutter was not installed yet)
b) Downloaded the Snakemake slurm profile with cookiecuttercookiecutter https://github.com/Snakemake-Profiles/slurm.git
with profile_name: slurm, no cluster_sidecar_help (2) and cluster_config: ../config/cluster.yaml
c) Edited the Snakemake slurm profile as follows: moved into the folderslurm
, opened the fileconfig.yaml
and deleted the linecluster-cancel: "scancel"
- Placed the test dataset into a dedicated directory on the HPC cluster (available for download from the Scilifelab Data Repository; DOI: 10.17044/scilifelab.19248172)
- Metadata files for three historical and three modern Sumatran rhinoceros samples (available for download from the Scilifelab Data Repository along with the test dataset;
) were edited as follows
a) Placed the metadata files for the test dataset into the subdirectory of the pipeline
b) Updated the metadata files. The paths to the FASTQ files (in columnspath_to_R1_fastq_file
) were updated with the correct paths on the HPC cluster to the corresponding FASTQ files from the test dataset
- The configuration files were updated with the correct paths on the HPC cluster to the test dataset (three configuration files are available for download from the Scilifelab Data Repository)
: Updated the path to the Sumatran rhinoceros mitochondrial genome FASTA file in line 97 (species_mt_path
) with the correct path to the corresponding FASTA file from the test dataset
: Updated the path to the Sumatran rhinoceros reference FASTA file in line 21 (ref_path
) with the correct full path to the corresponding FASTA file from the test dataset
: Updated the path to the White rhinoceros reference FASTA file in line 21 (ref_path
), the path to the White rhinoceros annotation in GTF format in line 455 (gtf_path
), as well as the paths to the GERP outgroup FASTA files in line 492 (gerp_ref_path
) and to the phylogenetic tree in line 501 (tree
) with the correct paths to the corresponding files from the test dataset
- Chose one of the following configuration files for the pipeline run and created a copy of the configuration file
: Maps reads from three historical and three modern Sumatran rhinoceros samples to the Sumatran rhinoceros mitochondrial genome and the mitochondrial genomes from several other species to identify samples with elevated mapping success to another species. Runcp config/config_mitogenomes.yaml config/config.yaml
: Maps reads from three historical and three modern Sumatran rhinoceros samples to a Sumatran rhinoceros reference. After data processing, runs mlRho, PCA, and ROH. Runcp config/config_sum_rhino.yaml config/config.yaml
: Maps reads from three historical and three modern Sumatran rhinoceros samples to a White rhinoceros reference. After data processing, runs PCA, snpEff, and GERP. Runcp config/config_white_rhino.yaml config/config.yaml
- Opened a tmux session to be able to run GenErode in the background with
tmux new-session -s generode
- Activated the conda environment with
conda activate generode
- Started a dry run to check if the pipeline run would work as expected with
snakemake -j 100 --use-singularity --profile slurm -npr &> YYMMDD_dry_run.out
(replace YYMMDD with the current date) and checked the fileYYMMDD_dry_run.out
for any errors - Started the main run of the pipeline with
snakemake -j 100 --use-singularity --profile slurm &> YYMMDD_main_run.out
and checked the fileYYMMDD_main_run.out
regularly during the run - After the pipeline run finished successfully, repeated 7. to 11. for the remaining configuration files
A minimal test dataset has been compiled to set up automatic testing of the pipeline on GitHub and is located in .test/data
. This data can be used to try out the pipeline with limited computational resources, e.g. on a local computer. Note that the results are not biologically meaningful because of the small size of the minimal test dataset.
is a reference fasta file with a short scaffold from the Sumatran rhinoceros reference genome. -
is a GTF file with gene predictions for the short scaffold from the Sumatran rhinoceros reference genome. -
is a fasta file containing the mitochondrial genome from Sumatran rhinoceros to test mapping of reads from historical samples to different mitochondrial genomes.
contains fastq files with Illumina paired-end reads of historical Sumatran rhinoceros samples. -
contains fastq files with Illumina paired-end reads of modern Sumatran rhinoceros samples. -
is the metadata file for the historical samples and.test/config/modern_paths.txt
is the metadata file for the modern samples.
Fastq files were generated as follows: reads from whole genome re-sequencing of the historical and modern samples were mapped to the full Sumatran rhinoceros reference genome with bwa mem and default settings. The test scaffold was extracted from the BAM files and converted to fastq format (containing only mapped paired-end reads). The same approach was applied to identify mitochondrial reads for historical samples, which were then sampled down to 500 paired-end reads per sample. The mitochondrial reads were finally merged with the reads that had mapped to the test scaffold to create two fastq files per sample (forward and reverse reads).
contains gzipped fasta files for outgroup species and a time-calibrated tree (generated on www.timetree.org) that can be used as input data for a GERP++ test run.
Fasta files for outgroup species were generated as follows: CDS were extracted from the Sumatran rhinoceros test scaffold and blasted to each of the outgroup species' full genomes with tblastx. The top blast hits per CDS were blasted back to the Sumatran rhinoceros CDS with blastn. All scaffolds with reciprocal blast hits were extracted per outgroup species.
The following should also work on a Windows 10 PC with the WSL2 subsystem but hasn't been tested.
brew install virtualbox
brew install vagrant
brew install vagrant-manager
mkdir vm-singularity
cd vm-singularity
git clone https://github.com/NBISweden/GenErode.git
The first time you create and bring up the virtual machine, follow these instructions:
export VM=sylabs/singularity-3.7-ubuntu-bionic64 && vagrant init $VM && vagrant up && vagrant ssh
The GenErode folder was not present in /home/vagrant
the first time I entered the virtual machine. If this is the case for you, try to follow these steps:
- Exit the virtual machine with
CTRL + d
and edit theVagrantfile
, uncommenting the respective line. Make sure you replace/fullpath/to/
with the full path on your Mac:
config.vm.synced_folder "/fullpath/to/vm-singularity/GenErode/", "/home/vagrant/vagrant_GenErode", disabled: false
- Reload the virtual machine (on the host, standing in the directory
vagrant reload
- Bring up the virtual machine & check if the folder is shared (you should find a folder named
vagrant up && vagrant ssh
If this was successful, continue with these steps:
The first time the virtual machine is run, you need to install Miniconda and mamba.
wget https://repo.continuum.io/miniconda/Miniconda3- .
bash Miniconda3-
rm Miniconda3-
. .bashrc
conda install -c conda-forge mamba
The pipelines' conda environment needs to be created the first time you want to run the pipeline:
cd vagrant_GenErode
mamba env create -f environment.yml -n generode
conda activate generode
You can test the pipeline with one of the existing configuration files from the .test/config
folder, for example to run the mapping of historical samples to mitochondrial genomes. Edit the config/config.yaml
file accordingly (or replace it with the file .test/config/config_mitogenomes.yaml
by running cp .test/config/config_mitogenomes.yaml config/config.yaml
snakemake --cores 1 --use-singularity -npr &> dry_run_test_mitos.out
snakemake --cores 1 --use-singularity &> main_run_test_mitos.out
- All results should be stored on your host OS, in the folder
. - After a successful main run, test other configurations, e.g. using the other config files in
- You can exit the virtual machine by typing
CTRL + d
Whenever you want to bring up the virtual machine again to run the pipeline, move into the folder vm-singularity
and type:
vagrant up && vagrant ssh
In the virtual machine, you need to activate the conda environment:
conda activate generode
Now you can run the pipeline as described above.