Skip to content

Commit

Permalink
Merge pull request #2187 from merenlab/update-sra-download
Browse files Browse the repository at this point in the history
Update sra_download
  • Loading branch information
mschecht authored Dec 12, 2023
2 parents b1f3d66 + b1aa760 commit 1cb6665
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 10 deletions.
24 changes: 19 additions & 5 deletions anvio/docs/workflows/sra-download.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
The `sra_download` workflow is a Snakemake workflow that downloads FASTQ files from SRA-accessions using [NCBI sra-tools wiki](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump), gzips them using [pigz](https://zlib.net/pigz/), and provides a %(samples-txt)s. You will need to have these tools installed before you start.
The `sra_download` workflow is a Snakemake workflow that downloads FASTQ files from SRA-accessions from [NCBI](https://www.ncbi.nlm.nih.gov/sra) e.g. SRR000001 and ERR000001. using [NCBI sra-tools wiki](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump), gzips them using [pigz](https://zlib.net/pigz/), and provides a %(samples-txt)s. You will need to have these tools installed before you start.

Let's get started.

Expand Down Expand Up @@ -42,15 +42,18 @@ $ cat sra_download_config.json
#### Modify any of the bells and whistles in the config file
{:.notice}
If this is the first time using an anvi'o Snakemake workflow, I would check out [Alon's blog post first](https://merenlab.org/2018/07/09/anvio-snakemake-workflows/#configjson).
If this is the first time using an anvi'o Snakemake workflow, check out [Alon's blog post first](https://merenlab.org/2018/07/09/anvio-snakemake-workflows/#configjson).
Feel free to adjust anything in the config file! Here are some to consider:
- `threads`: this can be optimized for any of the steps depending on the size and number of SRA accessions you are downloaded.
- `prefetch` `--max-size`: I already upped the amount from the default 40g but maybe you need more! For reference, I can download TARA Ocean metagenomes with the current parameter. You can use `vdb-dump --info` to learn how much the the `prefetch` step will download e.g. `vdb-dump SRR000001 --info`. Read more about that [here](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump#check-the-maximum-size-limit-of-the-prefetch-tool).
- `prefetch` `--max-size`: The default is 40g but maybe you need more! For reference, this `--max-size` can download TARA Ocean metagenomes. You can use `vdb-dump --info` to learn how much the the `prefetch` step will download e.g. `vdb-dump SRR000001 --info`. Read more about that [here](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump#check-the-maximum-size-limit-of-the-prefetch-tool).
### List of SRA accessions
The input for the `sra_download` workflow is `SRA_accession_list.txt`. This contains a list of your SRA accession you would like to download and it looks like this:
The input for the `sra_download` workflow is `SRA_accession_list.txt`. This contains a list of your SRA accessions you would like to download and it looks like this:
{:.warning}
All SRA accessions begin with the prefix `SRR` or `ERR` to denote their uploads to [NCBI](https://www.ncbi.nlm.nih.gov/sra) or [EBI](https://www.ebi.ac.uk/ena/browser/home) respectively.
```bash
$ cat SRA_accession_list.txt
Expand All @@ -74,4 +77,15 @@ anvi-run-workflow -w sra_download -c sra_download_config.json
### Go big and use an HPC!
The power of Snakemake shines when you can leverage a High Performance Computing system to parallize jobs. Check out the [Snakemake cluster documentation](https://snakemake.readthedocs.io/en/stable/executing/cluster.html#) on how to launch this workflow on your own HPC.
The power of Snakemake shines when you can leverage a High Performance Computing system to parallelize jobs. Check out the [Snakemake cluster documentation](https://snakemake.readthedocs.io/en/stable/executing/cluster.html#) on how to launch this workflow on your own HPC.
## Common use cases
### Download sequencing files associated with an NCBI BioSample
Here is how to use the `sra_download` workflow to download all of the sequencing files from an NCBI BioSample:
1. Search for the [NCBI BioSample](https://www.ncbi.nlm.nih.gov/biosample/) under `All Databases` on the [NCBI website](https://www.ncbi.nlm.nih.gov/).
2. Under `Genomes` click `SRA`
3. Send results to Run selector by clicking `Send to:` and then `Run Selector`
4. Here you can filter for specific sequencing in the project OR you can download the `Metadata` or `Accession list` to download a text file with ALL of the SRA accesssions associated with the BioSample. Put the SRA accessions into the `SRA_accession_list.txt` and start the workflow!
31 changes: 26 additions & 5 deletions anvio/workflows/sra_download/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,20 @@ class SRADownloadWorkflow(WorkflowSuperClass):
def __init__(self, args=None, run=terminal.Run(), progress=terminal.Progress()):
self.init_workflow_super_class(args, workflow_name='sra_download')

# check that NCBI SRA Toolkit is installed
if not utils.is_program_exists("prefetch", dont_raise=True) or not utils.is_program_exists("fasterq-dump", dont_raise=True):
raise ConfigError("'prefetch' and 'fasterq-dump' from the NCBI SRA toolkit must be installed for the "
"sra_download workflow to work. Please check out the installation instructions here: "
"https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit")
# check that NCBI SRA Toolkit and other programs are installed
NCBI_sra_tool_programs = ['prefetch', 'fasterq-dump']
other_programs = ['pigz']

for program in NCBI_sra_tool_programs:
if not utils.is_program_exists(program, dont_raise=True):
raise ConfigError(f"The program {program} is not installed in your anvi'o conda environment. "
f"'prefetch' and 'fasterq-dump' are from the NCBI SRA toolkit and must be installed for the "
f"sra_download workflow to work. Please check out the installation instructions here: "
f"https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit")
for program in other_programs:
if not utils.is_program_exists(program, dont_raise=True):
raise ConfigError(f"The program {program} is not installed in your anvi'o conda environment. Please "
f"double check you installed all of the programs listed in the anvio'o installation tutorial: https://anvio.org/install/")

# Snakemake rules
self.rules.extend(['prefetch',
Expand Down Expand Up @@ -82,6 +91,18 @@ def init(self):
raise ConfigError(f"Looks like your SRA accession list file, {self.SRA_accession_list}, is not properly formatted. "
f"This is what we know: {e}")

for accession in self.accessions_list:
if not accession.startswith(('SRR', 'ERR', 'DRR')):
if accession.startswith('SAMEA'):
raise ConfigError(f"anvi'o found an NCBI BioSample in your {self.SRA_accession_list}: {accession}. "
f"The anvi'o sra-download workflow only processes sequencing accessions that start with the prefix: ERR, SRR, or DRR. "
f"Search for the BioSample accession '{accession}' on the [NCBI SRA website](https://www.ncbi.nlm.nih.gov/sra) "
f"and find the sequencing accessions.")
else:
raise ConfigError(f"Looks like one of your \"SRA accessions\", {accession}, is not an SRA accession :( "
f"anvi'o asks that you kindly double check your SRA_accession_list.txt ({self.SRA_accession_list}) to confirm you "
f"are using the correct accessions. Hint: SRA accessions start with the prefix: ERR, SRR, or DRR")

self.target_files = self.get_target_files()


Expand Down

0 comments on commit 1cb6665

Please sign in to comment.