Merge pull request #2187 from merenlab/update-sra-download

Update sra_download
merenlab · Dec 12, 2023 · 1cb6665 · 1cb6665
2 parents b1f3d66 + b1aa760
commit 1cb6665
Show file tree

Hide file tree

Showing 2 changed files with 45 additions and 10 deletions.
diff --git a/anvio/docs/workflows/sra-download.md b/anvio/docs/workflows/sra-download.md
@@ -1,4 +1,4 @@
-The `sra_download` workflow is a Snakemake workflow that downloads FASTQ files from SRA-accessions using [NCBI sra-tools wiki](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump), gzips them using [pigz](https://zlib.net/pigz/), and provides a %(samples-txt)s. You will need to have these tools installed before you start.
+The `sra_download` workflow is a Snakemake workflow that downloads FASTQ files from SRA-accessions from [NCBI](https://www.ncbi.nlm.nih.gov/sra) e.g. SRR000001 and ERR000001. using [NCBI sra-tools wiki](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump), gzips them using [pigz](https://zlib.net/pigz/), and provides a %(samples-txt)s. You will need to have these tools installed before you start.
 
 Let's get started.
 
@@ -42,15 +42,18 @@ $ cat sra_download_config.json
 #### Modify any of the bells and whistles in the config file
 
 {:.notice}
-If this is the first time using an anvi'o Snakemake workflow, I would check out [Alon's blog post first](https://merenlab.org/2018/07/09/anvio-snakemake-workflows/#configjson).
+If this is the first time using an anvi'o Snakemake workflow, check out [Alon's blog post first](https://merenlab.org/2018/07/09/anvio-snakemake-workflows/#configjson).
 
 Feel free to adjust anything in the config file! Here are some to consider:
 - `threads`: this can be optimized for any of the steps depending on the size and number of SRA accessions you are downloaded.
-- `prefetch` `--max-size`: I already upped the amount from the default 40g but maybe you need more! For reference, I can download TARA Ocean metagenomes with the current parameter. You can use `vdb-dump --info` to learn how much the the `prefetch` step will download e.g. `vdb-dump SRR000001 --info`. Read more about that [here](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump#check-the-maximum-size-limit-of-the-prefetch-tool).
+- `prefetch` `--max-size`: The default is 40g but maybe you need more! For reference, this `--max-size` can download TARA Ocean metagenomes. You can use `vdb-dump --info` to learn how much the the `prefetch` step will download e.g. `vdb-dump SRR000001 --info`. Read more about that [here](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump#check-the-maximum-size-limit-of-the-prefetch-tool).
 
 ### List of SRA accessions
 
-The input for the `sra_download` workflow is `SRA_accession_list.txt`. This contains a list of your SRA accession you would like to download and it looks like this:
+The input for the `sra_download` workflow is `SRA_accession_list.txt`. This contains a list of your SRA accessions you would like to download and it looks like this:
+
+{:.warning}
+All SRA accessions begin with the prefix `SRR` or `ERR` to denote their uploads to [NCBI](https://www.ncbi.nlm.nih.gov/sra) or [EBI](https://www.ebi.ac.uk/ena/browser/home) respectively.
 
 ```bash
 $ cat SRA_accession_list.txt
@@ -74,4 +77,15 @@ anvi-run-workflow -w sra_download -c sra_download_config.json
 
 ### Go big and use an HPC!
 
-The power of Snakemake shines when you can leverage a High Performance Computing system to parallize jobs. Check out the [Snakemake cluster documentation](https://snakemake.readthedocs.io/en/stable/executing/cluster.html#) on how to launch this workflow on your own HPC.
+The power of Snakemake shines when you can leverage a High Performance Computing system to parallelize jobs. Check out the [Snakemake cluster documentation](https://snakemake.readthedocs.io/en/stable/executing/cluster.html#) on how to launch this workflow on your own HPC.
+
+## Common use cases
+
+### Download sequencing files associated with an NCBI BioSample
+
+Here is how to use the `sra_download` workflow to download all of the sequencing files from an NCBI BioSample:
+
+1. Search for the [NCBI BioSample](https://www.ncbi.nlm.nih.gov/biosample/) under `All Databases` on the [NCBI website](https://www.ncbi.nlm.nih.gov/).
+2. Under `Genomes` click `SRA`
+3. Send results to Run selector by clicking `Send to:` and then `Run Selector`
+4. Here you can filter for specific sequencing in the project OR you can download the `Metadata` or `Accession list` to download a text file with ALL of the SRA accesssions associated with the BioSample. Put the SRA accessions into the `SRA_accession_list.txt` and start the workflow!
diff --git a/anvio/workflows/sra_download/__init__.py b/anvio/workflows/sra_download/__init__.py
@@ -29,11 +29,20 @@ class SRADownloadWorkflow(WorkflowSuperClass):
     def __init__(self, args=None, run=terminal.Run(), progress=terminal.Progress()):
         self.init_workflow_super_class(args, workflow_name='sra_download')
 
-        # check that NCBI SRA Toolkit is installed
-        if not utils.is_program_exists("prefetch", dont_raise=True) or not utils.is_program_exists("fasterq-dump", dont_raise=True):
-            raise ConfigError("'prefetch' and 'fasterq-dump' from the NCBI SRA toolkit must be installed for the "
-                              "sra_download workflow to work. Please check out the installation instructions here: "
-                              "https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit")
+        # check that NCBI SRA Toolkit and other programs are installed
+        NCBI_sra_tool_programs = ['prefetch', 'fasterq-dump']
+        other_programs = ['pigz']
+
+        for program in NCBI_sra_tool_programs:
+            if not utils.is_program_exists(program, dont_raise=True):
+                raise ConfigError(f"The program {program} is not installed in your anvi'o conda environment. "
+                                  f"'prefetch' and 'fasterq-dump'  are from the NCBI SRA toolkit and must be installed for the "
+                                  f"sra_download workflow to work. Please check out the installation instructions here: "
+                                  f"https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit")
+        for program in other_programs:
+            if not utils.is_program_exists(program, dont_raise=True):
+                raise ConfigError(f"The program {program} is not installed in your anvi'o conda environment. Please "
+                                  f"double check you installed all of the programs listed in the anvio'o installation tutorial: https://anvio.org/install/")
 
         # Snakemake rules
         self.rules.extend(['prefetch',
@@ -82,6 +91,18 @@ def init(self):
                 raise ConfigError(f"Looks like your SRA accession list file, {self.SRA_accession_list}, is not properly formatted. "
                                   f"This is what we know: {e}")
 
+        for accession in self.accessions_list:
+            if not accession.startswith(('SRR', 'ERR', 'DRR')):
+                if accession.startswith('SAMEA'):
+                    raise ConfigError(f"anvi'o found an NCBI BioSample in your {self.SRA_accession_list}: {accession}. "
+                                      f"The anvi'o sra-download workflow only processes sequencing accessions that start with the prefix: ERR, SRR, or DRR. "
+                                      f"Search for the BioSample accession '{accession}' on the [NCBI SRA website](https://www.ncbi.nlm.nih.gov/sra) "
+                                      f"and find the sequencing accessions.")
+                else:
+                    raise ConfigError(f"Looks like one of your \"SRA accessions\", {accession}, is not an SRA accession :( "
+                                      f"anvi'o asks that you kindly double check your SRA_accession_list.txt ({self.SRA_accession_list}) to confirm you "
+                                      f"are using the correct accessions. Hint: SRA accessions start with the prefix: ERR, SRR, or DRR")
+
         self.target_files = self.get_target_files()