Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make genomic FASTA input optional #1490

Merged
merged 26 commits into from
Jan 22, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
a69d1d2
Update salmon indexing module
pinin4fjords Jan 21, 2025
c4a416d
Make fasta optional for gtf filtering
pinin4fjords Jan 21, 2025
0723238
Allow no fasta during param checks
pinin4fjords Jan 21, 2025
7c73f77
Rework prepare_genome for optional fasta
pinin4fjords Jan 21, 2025
53a1638
Fix bbsplit param usage for optional fasta
pinin4fjords Jan 21, 2025
4bb0af1
Add test for no fasta
pinin4fjords Jan 21, 2025
64a4547
lint fix
pinin4fjords Jan 21, 2025
d6ef689
Add snap for new test
pinin4fjords Jan 21, 2025
cbd5201
Restore output comments
pinin4fjords Jan 21, 2025
47b292c
Restore input comments
pinin4fjords Jan 21, 2025
b5e676b
Restore file comment
pinin4fjords Jan 21, 2025
b622f53
Restore existence checks
pinin4fjords Jan 21, 2025
0fdf742
Remove some unecessary changes
pinin4fjords Jan 21, 2025
f139bbe
Update changelog
pinin4fjords Jan 21, 2025
a9684ea
Remove duplicate section
pinin4fjords Jan 21, 2025
35ec56c
Fix for tweaked filtered GTF name
pinin4fjords Jan 22, 2025
0d4ef8f
Fix for tweaked filtered GTF name
pinin4fjords Jan 22, 2025
ae062b9
Update docs
pinin4fjords Jan 22, 2025
efb8e07
Temporarily disable 'latest-everything' testing due to incompatibilit…
pinin4fjords Jan 22, 2025
4e2dce2
Merge branch 'optional_fasta' of https://github.com/nf-core/rnaseq in…
pinin4fjords Jan 22, 2025
445ca7d
Apply suggestions from code review
pinin4fjords Jan 22, 2025
c45fbe5
Apply suggestions from code review
pinin4fjords Jan 22, 2025
bd585b0
Fix file names in snap
pinin4fjords Jan 22, 2025
871644d
Merge branch 'optional_fasta' of https://github.com/nf-core/rnaseq in…
pinin4fjords Jan 22, 2025
f07b1b1
Update usage.md
pinin4fjords Jan 22, 2025
21eb5ad
prettier
pinin4fjords Jan 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,6 @@ jobs:
matrix:
NXF_VER:
- "24.04.2"
- "latest-everything"
nf_test_files: ["${{ fromJson(needs.nf-test-changes.outputs.nf_test_files) }}"]
profile:
- "docker"
Expand Down
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- [PR #1480](https://github.com/nf-core/rnaseq/pull/1480) - Bump version after release 3.18.0
- [PR #1482](https://github.com/nf-core/rnaseq/pull/1482) - Update trimgalore module for save_unpaired fix
- [pR #1486](https://github.com/nf-core/rnaseq/pull/1486) - Bump STAR build for multiprocessing fix
- [PR #1486](https://github.com/nf-core/rnaseq/pull/1486) - Bump STAR build for multiprocessing fix
- [PR #1490](https://github.com/nf-core/rnaseq/pull/1490) - Make genomic FASTA input optional

# 3.18.0 - 2024-12-19

Expand Down
15 changes: 8 additions & 7 deletions bin/filter_gtf.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import argparse
import re
import statistics
from typing import Set
from typing import Optional, Set

# Create a logger
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
Expand All @@ -27,14 +27,15 @@ def tab_delimited(file: str) -> float:
return statistics.median(line.count("\t") for line in data.split("\n"))


def filter_gtf(fasta: str, gtf_in: str, filtered_gtf_out: str, skip_transcript_id_check: bool) -> None:
def filter_gtf(fasta: Optional[str], gtf_in: str, filtered_gtf_out: str, skip_transcript_id_check: bool) -> None:
"""Filter GTF file based on FASTA sequence names."""
if tab_delimited(gtf_in) != 8:
raise ValueError("Invalid GTF file: Expected 9 tab-separated columns.")

seq_names_in_genome = extract_fasta_seq_names(fasta)
logger.info(f"Extracted chromosome sequence names from {fasta}")
logger.debug("All sequence IDs from FASTA: " + ", ".join(sorted(seq_names_in_genome)))
if (fasta is not None):
seq_names_in_genome = extract_fasta_seq_names(fasta)
logger.info(f"Extracted chromosome sequence names from {fasta}")
logger.debug("All sequence IDs from FASTA: " + ", ".join(sorted(seq_names_in_genome)))

seq_names_in_gtf = set()
try:
Expand All @@ -44,7 +45,7 @@ def filter_gtf(fasta: str, gtf_in: str, filtered_gtf_out: str, skip_transcript_i
seq_name = line.split("\t")[0]
seq_names_in_gtf.add(seq_name) # Add sequence name to the set

if seq_name in seq_names_in_genome:
if fasta is None or seq_name in seq_names_in_genome:
if skip_transcript_id_check or re.search(r'transcript_id "([^"]+)"', line):
out.write(line)
line_count += 1
Expand All @@ -63,7 +64,7 @@ def filter_gtf(fasta: str, gtf_in: str, filtered_gtf_out: str, skip_transcript_i
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Filters a GTF file based on sequence names in a FASTA file.")
parser.add_argument("--gtf", type=str, required=True, help="GTF file")
parser.add_argument("--fasta", type=str, required=True, help="Genome fasta file")
parser.add_argument("--fasta", type=str, required=False, help="Genome fasta file")
parser.add_argument("--prefix", dest="prefix", default="genes", type=str, help="Prefix for output GTF files")
parser.add_argument(
"--skip_transcript_id_check", action="store_true", help="Skip checking for transcript IDs in the GTF file"
Expand Down
22 changes: 19 additions & 3 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ You also have the option to pseudoalign and quantify your data directly with [Sa

The library preparation protocol (library type) used by Salmon quantification is inferred by the pipeline based on the information provided in the samplesheet, however, you can override it using the `--salmon_quant_libtype` parameter. You can find the available options in the [Salmon documentation](https://salmon.readthedocs.io/en/latest/library_type.html). Similarly, strandedness is taken from the sample sheet or calculated automatically, and passed to Kallisto on a per-library basis, but you can apply a global override by setting the Kallisto strandedness parameters in `--extra_kallisto_quant_args` like `--extra_kallisto_quant_args '--fr-stranded'` see the [Kallisto documentation](https://pachterlab.github.io/kallisto/manual).

When running Salmon in mapping-based mode via `--pseudo_aligner salmon` the entire genome of the organism is used by default for the decoy-aware transcriptome when creating the indices (see second bulleted option in [Salmon documentation](https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode)).
When running Salmon in mapping-based mode via `--pseudo_aligner salmon`, supplying a genome fasta via `--fasta` and not supplying a Salmon index, the entire genome of the organism is used by default for the decoy-aware transcriptome when creating the indices, as is recommended (see second bulleted option in [Salmon documentation](https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode)). If you do not supply a FASTA file or an index, Salmon will index without those decoys, using only transcript sequences in the index. This second option is not usually recommended, but may be useful in limited circumstances. Note that Kallisto does not index with genomic sequences.

Two additional parameters `--extra_star_align_args` and `--extra_salmon_quant_args` were added in v3.10 of the pipeline that allow you to append any custom parameters to the STAR align and Salmon quant commands, respectively. Note, the `--seqBias` and `--gcBias` are not provided to Salmon quant by default so you can provide these via `--extra_salmon_quant_args '--seqBias --gcBias'` if required. You can now also supply additional arguments to Kallisto via `--extra_kallisto_quant_args`.

Expand Down Expand Up @@ -209,7 +209,7 @@ When supplying reference files as discussed below, it is important to be consist

### Explicit reference file specification (recommended)

The minimum reference genome requirements for this pipeline are a FASTA and GTF file, all other files required to run the pipeline can be generated from these files. For example, the latest reference files for human can be derived from Ensembl like:
The minimum reference genome requirements for this pipeline are a FASTA file (genome and/ or transcriptome) and GTF file, all other files required to run the pipeline can be generated from these files. For example, the latest reference files for human can be derived from Ensembl like:

```
latest_release=$(curl -s 'http://rest.ensembl.org/info/software?content-type=application/json' | grep -o '"release":[0-9]*' | cut -d: -f2)
Expand All @@ -227,6 +227,7 @@ Notes:
- If `--gene_bed` is not provided then it will be generated from the GTF file.
- If `--additional_fasta` is provided then the features in this file (e.g. ERCC spike-ins) will be automatically concatenated onto both the reference FASTA file as well as the GTF annotation before building the appropriate indices.
- When using `--aligner star_rsem`, both the STAR and RSEM indices should be present in the path specified by `--rsem_index` (see [#568](https://github.com/nf-core/rnaseq/issues/568)).
- If the `--skip_alignment` option is used along with `--transcript_fasta`, the pipeline can technically run without providing the genomic FASTA (`--fasta`). However, this approach is **not recommended** with `--pseudo_aligner salmon`, as any dynamically generated Salmon index will lack decoys. To ensure optimal indexing with decoys, it is **highly recommended** to include the genomic FASTA (`--fasta`) with Salmon, unless a pre-existing decoy-aware Salmon index is supplied. For more details on the benefits of decoy-aware indexing, refer to the [Salmon documentation](https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode).

#### Reference genome

Expand Down Expand Up @@ -304,7 +305,7 @@ Notes:

### GTF filtering

By default, the input GTF file will be filtered to ensure that sequence names correspond to those in the genome fasta file, and to remove rows with empty transcript identifiers. Filtering can be bypassed completely where you are confident it is not necessary, using the `--skip_gtf_filter` parameter. If you just want to skip the 'transcript_id' checking component of the GTF filtering script used in the pipeline this can be disabled specifically using the `--skip_gtf_transcript_filter` parameter.
By default, the input GTF file will be filtered to ensure that sequence names correspond to those in the genome fasta file (where supplied), and to remove rows with empty transcript identifiers. Filtering can be bypassed completely where you are confident it is not necessary, using the `--skip_gtf_filter` parameter. If you just want to skip the 'transcript_id' checking component of the GTF filtering script used in the pipeline this can be disabled specifically using the `--skip_gtf_transcript_filter` parameter.

## Contamination screening options

Expand Down Expand Up @@ -332,6 +333,21 @@ nextflow run \
-profile docker
```

You can also run without a genomic FASTA file, provided you skip the alignment step and provide a transcriptome FASTA directly:

```bash
nextflow run \
nf-core/rnaseq \
--input <SAMPLESHEET> \
--outdir <OUTDIR> \
--gtf <GTF> \
--transcript_fasta <TRANSCRIPTOME FASTA> \
--skip_alignment \
-profile docker
```

This is not usually recommended with Salmon unless you also supply a previously generated decoy-aware Salmon transcriptome index.

> **NB:** Loading iGenomes configuration remains the default for reasons of consistency with other workflows, but should be disabled when not using iGenomes, applying the recommended usage above.

This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.
Expand Down
2 changes: 1 addition & 1 deletion modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@
},
"salmon/index": {
"branch": "master",
"git_sha": "49f4e50534fe4b64101e62ea41d5dc43b1324358",
"git_sha": "25ddc0bb25292280923eed07e6351789a671e86a",
"installed_by": ["fastq_subsample_fq_salmon"]
},
"salmon/quant": {
Expand Down
8 changes: 6 additions & 2 deletions modules/local/gtf_filter/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,15 @@ process GTF_FILTER {
task.ext.when == null || task.ext.when

script: // filter_gtf.py is bundled with the pipeline, in nf-core/rnaseq/bin/
fasta_text=''
if (fasta){
fasta_text="--fasta $fasta"
}
"""
filter_gtf.py \\
--gtf $gtf \\
--fasta $fasta \\
--prefix ${fasta.baseName}
$fasta_text \\
--prefix ${gtf.baseName}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
27 changes: 17 additions & 10 deletions modules/nf-core/salmon/index/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

43 changes: 40 additions & 3 deletions modules/nf-core/salmon/index/tests/main.nf.test

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

31 changes: 23 additions & 8 deletions modules/nf-core/salmon/index/tests/main.nf.test.snap

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading