Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand GTF filtering to remove rows with empty transcript ID when required, fix STAR GTF usage #1107

Merged
merged 48 commits into from
Nov 15, 2023
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
688d19b
Add special GTF handling for stringtie
pinin4fjords Nov 7, 2023
c6d354d
Update CHANGELOG
pinin4fjords Nov 7, 2023
82b67e1
appease eclint
pinin4fjords Nov 7, 2023
5dbbe0c
Comment dollar
pinin4fjords Nov 7, 2023
124c570
Fix multiqc to avoid imp module error
pinin4fjords Nov 3, 2023
57d4bde
Actually, unify GTF filtering
pinin4fjords Nov 9, 2023
7908ca3
Reverse test profile change
pinin4fjords Nov 9, 2023
f791e8a
misc fixes
pinin4fjords Nov 9, 2023
19a72f6
Blackify
pinin4fjords Nov 9, 2023
2cfea9d
move module to right path
pinin4fjords Nov 9, 2023
9b4e140
update changelog
pinin4fjords Nov 9, 2023
ec6f80e
Correct selector
pinin4fjords Nov 9, 2023
25324ba
Make gtf filtering dependent on gtf
pinin4fjords Nov 9, 2023
26e9c1c
Fix multiqc to avoid imp module error
pinin4fjords Nov 3, 2023
251e368
Fix transcript_id pattern
pinin4fjords Nov 9, 2023
6a3ea7e
blackify
pinin4fjords Nov 9, 2023
bb47808
emit chromosome-filtered gtf from genome preparation
pinin4fjords Nov 9, 2023
047f749
[skip ci] fix changelog
pinin4fjords Nov 9, 2023
4153d4a
Merge branch 'stringtie_gtf' of github.com:nf-core/rnaseq into string…
pinin4fjords Nov 9, 2023
a801013
Thank myself to poke CI
pinin4fjords Nov 9, 2023
910ad75
Merge branch 'stringtie_gtf' of github.com:nf-core/rnaseq into string…
pinin4fjords Nov 10, 2023
16430c2
filter_gtf improvements
pinin4fjords Nov 10, 2023
65ac5ef
Blackify
pinin4fjords Nov 10, 2023
c4823ea
[skip ci] Add resolved issues to changelog
pinin4fjords Nov 10, 2023
c2842dc
Merge branch 'dev' into stringtie_gtf
pinin4fjords Nov 10, 2023
7a977bb
Merge remote-tracking branch 'origin/dev' into stringtie_gtf
pinin4fjords Nov 10, 2023
ecb3a14
reinstate publish modes
pinin4fjords Nov 10, 2023
1c23a4a
[skip ci] Reduce logging in GTF filtering
pinin4fjords Nov 14, 2023
0a70d58
[skip ci] Apply suggestions from code review
pinin4fjords Nov 14, 2023
996abcc
[skip ci] post-suggestion integration work
pinin4fjords Nov 14, 2023
01e05f7
[skip ci] more post-review tidy-up to GTF filtering script
pinin4fjords Nov 14, 2023
556af91
[skip ci] reinstate logging
pinin4fjords Nov 14, 2023
ad998c4
[skip ci] reinstate logging
pinin4fjords Nov 14, 2023
80a9543
[skip ci] Fix salmon subworkflow call for gtf
pinin4fjords Nov 14, 2023
519e42a
[skip ci] also raise value for no transcript ids
pinin4fjords Nov 14, 2023
5e1378e
[skip ci] @drpatelh feedback: Try to avoid filtering process where no…
pinin4fjords Nov 14, 2023
fb70fdf
Simplify GTF filtering
pinin4fjords Nov 15, 2023
5c213c5
Blackify
pinin4fjords Nov 15, 2023
6d55142
Document GTF filtering
pinin4fjords Nov 15, 2023
6e1454b
[automated] Fix linting with Prettier
nf-core-bot Nov 15, 2023
d1ce4f9
Merge branch 'dev' into stringtie_gtf
pinin4fjords Nov 15, 2023
aff1d2e
Bump modules config for merged module
pinin4fjords Nov 15, 2023
f7d078c
Bump pipeline version to 3.13.0
drpatelh Nov 15, 2023
1c6012e
[skip ci] Changes from local review
drpatelh Nov 15, 2023
e7126c8
Move --skip_* parameters to the same section
drpatelh Nov 15, 2023
13cc0e9
[skip ci] Tidy up code block
drpatelh Nov 15, 2023
3ac4a3a
Move filter_gtf logic into main workflow
drpatelh Nov 15, 2023
b67558d
[automated] Fix linting with Prettier
nf-core-bot Nov 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Special thanks to the following for their contributions to the release:
- [Júlia Mir Pedrol](https://github.com/mirpedrol)
- [Matthias Zepper](https://github.com/MatthiasZepper)
- [Maxime Garcia](https://github.com/maxulysse)
- [Jonathan Manning](https://github.com/pinin4fjords)

Thank you to everyone else that has contributed by reporting bugs, enhancements or in any other way, shape or form.

Expand All @@ -28,6 +29,10 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
- [PR #1083](https://github.com/nf-core/rnaseq/pull/1083) - Move local modules and subworkflows to subfolders
- [PR #1088](https://github.com/nf-core/rnaseq/pull/1088) - Updates contributing and code of conduct documents with nf-core template 2.10
- [PR #1091](https://github.com/nf-core/rnaseq/pull/1091) - Reorganise parameters in schema for better usability
- [PR #1107](https://github.com/nf-core/rnaseq/pull/1107) - Expand GTF filtering to remove rows with empty transcript ID when required, fix STAR GTF usage
- [#1082](https://github.com/nf-core/rnaseq/issues/1082) - More informative error message for filter_gtf_for_genes_in_genome.py
- [#1102](https://github.com/nf-core/rnaseq/issues/1102) - gene entries with empty transcript_id fields
- [#1074](https://github.com/nf-core/rnaseq/issues/1074) - Enable quantification using StringTie AND a custom Ensembl genome

### Software dependencies

Expand Down
119 changes: 119 additions & 0 deletions bin/filter_gtf.py
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
#!/usr/bin/env python
from __future__ import print_function
import logging
from itertools import groupby
import argparse
import re

# Create a logger
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger(__file__)
logger.setLevel(logging.INFO)


def is_header(line: str) -> bool:
"""Returns True if the given line is a header line in a FASTA file."""
return line[0] == ">"


def extract_fasta_seq_names(fasta_name: str) -> set:
"""Extracts the sequence names from a FASTA file.

modified from Brent Pedersen
Correct Way To Parse A Fasta File In Python
given a fasta file. yield tuples of header, sequence
from https://www.biostars.org/p/710/

Args:
fasta_name: The path to the FASTA file.

Returns:
A set of the sequence names in the FASTA file.
"""

# first open the file outside
fh = open(fasta_name)

# ditch the boolean (x[0]) and just keep the header or sequence since
# we know they alternate.
faiter = (x[1] for x in groupby(fh, is_header))

for i, header in enumerate(faiter):
line = next(header)
if is_header(line):
# drop the ">"
headerStr = line[1:].strip().split()[0]
yield headerStr
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved


def extract_genes_in_genome(fasta: str, gtf_in: str, gtf_out: str) -> None:
"""Extracts the genes in the genome from a GTF file.

Args:
fasta: The path to the FASTA file.
gtf_in: The path to the input GTF file.
gtf_out: The path to the output GTF file.

Raises:
ValueError: If no overlap is found or if the GTF file is not tab delimited.
"""

def is_tab_delimited(file):
with open(file, "r") as f:
return "\t" in f.readline()
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved

if not is_tab_delimited(gtf_in):
raise ValueError("The GTF file is not tab delimited.")
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved

seq_names_in_genome = set(extract_fasta_seq_names(fasta))
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved
logger.info(f"Extracted chromosome sequence names from {fasta}")
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved
logger.info("All chromosome names: " + ", ".join(sorted(seq_names_in_genome)))

with open(gtf_in) as gtf, open(gtf_out, "w") as out:
seq_names_in_gtf = {line.split("\t")[0] for line in gtf if line.strip()}
overlap = seq_names_in_genome & seq_names_in_gtf
if not overlap:
raise ValueError("No overlapping scaffolds found.")

gtf.seek(0) # Reset file pointer to the start of the file
for line in gtf:
if line.split("\t")[0] in overlap:
out.write(line)

logger.info(f"Extracted {len(overlap)} matching sequences from {gtf_in} into {gtf_out}")
logger.info("All sequence IDs from GTF: " + ", ".join(sorted(seq_names_in_gtf)))
logger.info(f"Wrote matching lines to {gtf_out}")


def remove_features_without_transcript_id(gtf_in: str, gtf_out: str) -> None:
"""
Removes gene rows with absent or empty transcript_id attributes from a GTF file.

Args:
gtf_in: Path to the input GTF file.
gtf_out: The path to the output GTF file.
"""

with open(gtf_in, "r") as f_in, open(gtf_out, "w") as f_out:
for line in f_in:
transcript_id_match = re.search(r'transcript_id "([^"]+)"', line)
if transcript_id_match:
f_out.write(line)
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="""Filter GTF for various reasons""")
parser.add_argument("--gtf", type=str, help="GTF file")
parser.add_argument("--fasta", type=str, help="Genome fasta file")
parser.add_argument(
"-p",
"--prefix",
dest="prefix",
default="genes",
type=str,
help="Prefix for output GTF files",
)

args = parser.parse_args()
extract_genes_in_genome(args.fasta, args.gtf, args.prefix + "_in_genome.gtf")
remove_features_without_transcript_id(args.prefix + "_in_genome.gtf", args.prefix + "_with_transcript_ids.gtf")
78 changes: 0 additions & 78 deletions bin/filter_gtf_for_genes_in_genome.py

This file was deleted.

4 changes: 3 additions & 1 deletion conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ process {
]
}

withName: 'GTF_GENE_FILTER' {
withName: 'GTF_FILTER' {
publishDir = [
path: { "${params.outdir}/genome" },
mode: params.publish_dir_mode,
Expand Down Expand Up @@ -155,13 +155,15 @@ process {
ext.args = '--record-count 1000000 --seed 1'
ext.prefix = { "${meta.id}.subsampled" }
publishDir = [
mode: params.publish_dir_mode,
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved
enabled: false
]
}

withName: '.*:FASTQ_SUBSAMPLE_FQ_SALMON:SALMON_QUANT' {
ext.args = '--skipQuant'
publishDir = [
mode: params.publish_dir_mode,
enabled: false
]
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
process GTF_GENE_FILTER {
process GTF_FILTER {
tag "$fasta"

conda "conda-forge::python=3.9.5"
Expand All @@ -11,18 +11,19 @@ process GTF_GENE_FILTER {
path gtf

output:
path "*.gtf" , emit: gtf
path "versions.yml", emit: versions
path "*_in_genome.gtf" , emit: genome_gtf
path "*_with_transcript_ids.gtf" , emit: transcript_id_gtf
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when

script: // filter_gtf_for_genes_in_genome.py is bundled with the pipeline, in nf-core/rnaseq/bin/
"""
filter_gtf_for_genes_in_genome.py \\
filter_gtf.py \\
--gtf $gtf \\
--fasta $fasta \\
-o ${fasta.baseName}_genes.gtf
--prefix ${fasta.baseName}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
6 changes: 3 additions & 3 deletions modules/local/multiqc/main.nf
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
process MULTIQC {
label 'process_medium'

conda "bioconda::multiqc=1.15"
conda "bioconda::multiqc=1.17"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/multiqc:1.15--pyhdfd78af_0' :
'biocontainers/multiqc:1.15--pyhdfd78af_0' }"
'https://depot.galaxyproject.org/singularity/multiqc:1.17--pyhdfd78af_0' :
'biocontainers/multiqc:1.17--pyhdfd78af_0' }"
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved

input:
path multiqc_config
Expand Down
73 changes: 41 additions & 32 deletions subworkflows/local/prepare_genome/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ include { RSEM_PREPAREREFERENCE as MAKE_TRANSCRIPTS_FASTA } from '../../..
include { PREPROCESS_TRANSCRIPTS_FASTA_GENCODE } from '../../../modules/local/preprocess_transcripts_fasta_gencode'
include { GTF2BED } from '../../../modules/local/gtf2bed'
include { CAT_ADDITIONAL_FASTA } from '../../../modules/local/cat_additional_fasta'
include { GTF_GENE_FILTER } from '../../../modules/local/gtf_gene_filter'
include { GTF_FILTER } from '../../../modules/local/gtf_filter'
include { STAR_GENOMEGENERATE_IGENOMES } from '../../../modules/local/star_genomegenerate_igenomes'

workflow PREPARE_GENOME {
Expand Down Expand Up @@ -68,22 +68,31 @@ workflow PREPARE_GENOME {
//
// Uncompress GTF annotation file or create from GFF3 if required
//
if (gtf) {
if (gtf.endsWith('.gz')) {
ch_gtf = GUNZIP_GTF ( [ [:], gtf ] ).gunzip.map { it[1] }
ch_versions = ch_versions.mix(GUNZIP_GTF.out.versions)
} else {
ch_gtf = Channel.value(file(gtf))
}
} else if (gff) {
if (gff.endsWith('.gz')) {
ch_gff = GUNZIP_GFF ( [ [:], gff ] ).gunzip.map { it[1] }
ch_versions = ch_versions.mix(GUNZIP_GFF.out.versions)
} else {
ch_gff = Channel.value(file(gff))
if (gtf || gff) {
if (gtf) {
if (gtf.endsWith('.gz')) {
ch_gtf = GUNZIP_GTF ( [ [:], gtf ] ).gunzip.map { it[1] }
ch_versions = ch_versions.mix(GUNZIP_GTF.out.versions)
} else {
ch_gtf = Channel.value(file(gtf))
}
} else if (gff) {
if (gff.endsWith('.gz')) {
ch_gff = GUNZIP_GFF ( [ [:], gff ] ).gunzip.map { it[1] }
ch_versions = ch_versions.mix(GUNZIP_GFF.out.versions)
} else {
ch_gff = Channel.value(file(gff))
}
ch_gtf = GFFREAD ( ch_gff ).gtf
ch_versions = ch_versions.mix(GFFREAD.out.versions)
}
ch_gtf = GFFREAD ( ch_gff ).gtf
ch_versions = ch_versions.mix(GFFREAD.out.versions)

//
// Apply filtering we may need for GTFs
//
GTF_FILTER ( ch_fasta, ch_gtf )
ch_gtf_with_transcript_ids = GTF_FILTER.out.transcript_id_gtf
ch_gtf_genome = GTF_FILTER.out.genome_gtf
}

//
Expand Down Expand Up @@ -133,9 +142,8 @@ workflow PREPARE_GENOME {
ch_versions = ch_versions.mix(PREPROCESS_TRANSCRIPTS_FASTA_GENCODE.out.versions)
}
} else {
ch_filter_gtf = GTF_GENE_FILTER ( ch_fasta, ch_gtf ).gtf
ch_transcript_fasta = MAKE_TRANSCRIPTS_FASTA ( ch_fasta, ch_filter_gtf ).transcript_fasta
ch_versions = ch_versions.mix(GTF_GENE_FILTER.out.versions)
ch_transcript_fasta = MAKE_TRANSCRIPTS_FASTA ( ch_fasta, ch_gtf_genome ).transcript_fasta
ch_versions = ch_versions.mix(GTF_FILTER.out.versions)
ch_versions = ch_versions.mix(MAKE_TRANSCRIPTS_FASTA.out.versions)
}

Expand Down Expand Up @@ -259,18 +267,19 @@ workflow PREPARE_GENOME {
}

emit:
fasta = ch_fasta // channel: path(genome.fasta)
gtf = ch_gtf // channel: path(genome.gtf)
fai = ch_fai // channel: path(genome.fai)
gene_bed = ch_gene_bed // channel: path(gene.bed)
transcript_fasta = ch_transcript_fasta // channel: path(transcript.fasta)
chrom_sizes = ch_chrom_sizes // channel: path(genome.sizes)
splicesites = ch_splicesites // channel: path(genome.splicesites.txt)
bbsplit_index = ch_bbsplit_index // channel: path(bbsplit/index/)
star_index = ch_star_index // channel: path(star/index/)
rsem_index = ch_rsem_index // channel: path(rsem/index/)
hisat2_index = ch_hisat2_index // channel: path(hisat2/index/)
salmon_index = ch_salmon_index // channel: path(salmon/index/)
fasta = ch_fasta // channel: path(genome.fasta)
gtf = ch_gtf_genome // channel: path(genome.gtf)
fai = ch_fai // channel: path(genome.fai)
gene_bed = ch_gene_bed // channel: path(gene.bed)
gtf_with_transcript_ids = ch_gtf_with_transcript_ids // channel: path(gtf)
transcript_fasta = ch_transcript_fasta // channel: path(transcript.fasta)
chrom_sizes = ch_chrom_sizes // channel: path(genome.sizes)
splicesites = ch_splicesites // channel: path(genome.splicesites.txt)
bbsplit_index = ch_bbsplit_index // channel: path(bbsplit/index/)
star_index = ch_star_index // channel: path(star/index/)
rsem_index = ch_rsem_index // channel: path(rsem/index/)
hisat2_index = ch_hisat2_index // channel: path(hisat2/index/)
salmon_index = ch_salmon_index // channel: path(salmon/index/)

versions = ch_versions.ifEmpty(null) // channel: [ versions.yml ]
versions = ch_versions.ifEmpty(null) // channel: [ versions.yml ]
}
Loading
Loading