Skip to content

Commit

Permalink
v0.2.0 (#39)
Browse files Browse the repository at this point in the history
* filter_tx_by_cond - filters individual GTF files. Currently untested

* initial commit new script to extract candidate novel last exons

* get_novel_last_exons - _df_add not _df_annotate (script still untested

* functions to id extension & spliced events (untested)

* updates to get script working (internal seem to be missing?)

* Fix logic of extract_format_exons_introns. Fix event type classification (don't add region rank to novel obj)

* use read_gtf_restricted -saves ~30s reading in ref GTF vs full

* if tx has multiple event types assigned, collapse to ',' sep string

* output spliced last exons not last SJs

* output spliced last exons not last SJs

* skeleton of merge last exons script

* working script to merge last_exon GTFs, assign sample & last_exon IDs

* Script starts with last exons GTF, remove input match stats TSV. Report motif dist from 3'end. Select 3'end with min deviation from exptd position for motif only. Works at CL

* select repr atlas site. Update 3'end to nearest atlas. Output updated/selected les to gtf, summary stats are per last exon ID

* update function name for clarity of task

* fnc to read gtf with extraction of custom GTF attributes only

* initial commit get_quant_gtf - getting annoying read_gtf error. _fetch_attributes regex needs fixing (match space)

* update conda env

* get le ids for ext & no ext genes. Internal extension unique regions be weird...

* output GTFs and tx2le, le2gene dfs

* fix regex in _fetch_attributes - extracts correctly if has suffixed cols

* add command line args

* add option to not add last exon ID col

* fixes to filter_tx_by_condition_tpm.py

* I give up

* fix check_stranded to ensure outputs stranded gr

* list of conditions without duplicates

* fix check_per_sample_mean_tpm_filtered. Some log name corrections

* output 'not_found' string if no motifs founds - empty str attributes in GTF cause read error in future step (bug)

* check_concat to ensure all dfs of gr have same columns

* manually set strandedness for pr.subtract (bug in PyRanges)

* Report n IDs dropped due to ref containment. sort assign dfs by le_id

* tx_to_polya_quant.R - update CL help strings

* rename ref_gene_id to gene_id in tx2gene, le2gene output

* remove hardcoding of expected distance from 3'end

* get_novel_last_exons.py - enforce strandedness="same" for pr.join calls

* scripts/tx_to_polya_quant.R - fix variable name reference

* Reorganise output so consistent x folders. Label stie tx_ids with sample ID (prevent edge case of same ID x samples (causes salmon index error))

* update cluster & rule group assignments

* Output PPAU & gene total TPM matrices. Update R conda env. Set all count matrices as pipeline targets

* options to switch on/off 5'end matching (exts & spliced). Exclude extension failing filter events from check for spliced events

* output last exons not last 100nt. More strict on retained cols. Add 3p dist for min deviation motif as attribute. Report pairing counts for atlas/motif filters.

* add flags to switch on/off 5'end matching (SJs) & 3'end matching

* add new ward i3 2022 sample table

* add ward i3 2022 sample table with reprocessed bams

* remove duplicate Start -1 in read_gtf_specific (main repo PR 260)

* correct config key for point features file

* add option to make point features from standard BED

* add function to collapse metadata cols for given id column

* collapse_metadata - add check for collapse_cols found in individual df

* get_novel_last_exons - try to fill NAs before output as GTF (causes parsing errors downstream)

* get_novel_last_exons - Stop empty strings being output in attribute column

* add option to collapse metadata cols whilst dropping duplicates

* Add check for n column consistency in all dfs of PyRanges

* drop 3'end filtering group rule...

* fix bug with ','-sep novel ref_gene_ids strings being treated as separate genes to ref. Output info df with event types, coordinates, annot status etc.

* Add checks for empty gr (for multi-ref gene matching)

* drop 1-isoform genes with ref_gene_id not gene_id (novel isos dropped)

* update cli to le2gene (and help message)

* command line script to run basic diff usage analysis

* quantify single-end files with Salmon

* Optionally treat specifically labelled reference tx IDs as novel extensions

* remove refs to multiple combos of stringtie parameters

* initial commit generate test GTFs and genome FASTA

* ensure txipts have/don't have PAS motifs as req. Output txipts FASTA

* tidy up tx names etc. for test trs

* script to generate simulated FQs for test transcripts

* Generate test PAS atlas BED file

* add tiny test data to repo (successful test run). Add check for consistent ncols in filter_tx_by_three_end

* read_gtf_specific checks order of keys before extracting from attribute (prevents some parsing bugs)

* report gene name in output GTF (as 'gene_name_ref')

* report gene name in combined quant GTF. Also output le2genename TSV. Remove a few unnecessary keys from attribute output

* quick note in config on hardcoded label

* report summary dbrn of distances between predicted 3'ends and nearest atlas PAS

* classify event types for annotated last exons. Report coords of full LE in .info.txt. Format with black

* add diff usage with saturn script/option to pipeline (works with test data)

* update gitignore to ignore all suffixed (default) test data output dirs

* initial commit script to tidy up saturn tbl & add annot & ppau info

* add process saturn results tbl script to pipeline (works with test data)

* (incomplete) updates to readme & other documentation

* initial commit - untested script to generate quant GTF from reference only. Add option to pipeline to only quantify

* bug fix in annotate_le_ids_ref to get successful test run

* fix annotation of event_types for ref events

* add gene with just ref txipts to test data. Add a tx with a 'ref_extension_string'

* update test data (reads & alignment) to include gene with just ref ALEs.

* update test data - alignment mistake missed out chr3 reads. Also update FCs

* option to skip identifying novel LEs & running differential testing. fix treating specific ref txs as separate les. All runs successful with test data

* fix check for no extensions with provided string to raise exception properly

* fix collapsing annotation info for le_ids bug. le_ids with inconsistent unique vals are not collapsed

* update gitignore

* fix gene_id_ref collapsing in case of NaNs

* fix reporting count of events with ref_extension_string

* Add option to use input GTF of last exons to combine with ref GTF. Extra checks on boolean flags for pipeline specification. Successful dry run

* add script to combine & annotate predicted last exons from multiple experiments into a single gtf

* Add an explicit check for hyphen characters in sample_name IDs from sample_tbl before declaring pipeline run

* document requirement for no hyphens in sample_name column of sample table

* optionally pass bias/other cl flags to salmon quant

* dryrun - option to use precomputed salmon index and skip redundant construction

* pass config file as command line arg

* provide conda prefix as command line arg

* correct salmon index path in params if use precomputed index

* example of saturn output

* Replace SatuRn with DEXSeq (#47)

* switch saturn for dexseq in differential_apa

* remove mentions to saturn and associated scripts

* add dexseq_apa to cluster.yaml

* Clean up README

* update readme

* update docs on quant & differential output files

* add note on specifying base condition in sample table

* add docs on final combined novel GTF

* add docs on combined GTFs

* notes on ID mapping files. Remove redundancy for event_type descriptions

* remove sge profile (not working)

* minor updates to README

* remove deprecated parameters

* link to example files

* remove empty md

* remove deprecated scripts

* add license information

* tiny formatting change
  • Loading branch information
SamBryce-Smith authored Jan 19, 2024
1 parent e2f7a81 commit 04a10a0
Show file tree
Hide file tree
Showing 92 changed files with 13,827 additions and 13,932 deletions.
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

*.log
.snakemake/.DS_Store
.DS_Store
Expand All @@ -12,3 +11,13 @@ data/annotation/
data/Cont*
data/TDP43*
scripts/.ipynb_checkpoints/
.Rproj.user
test_data_output*
tests/test_aligned/STAR_aligned/*.out
tests/test_aligned/STAR_aligned/*.out.t*
tests/test_aligned/STAR_aligned/*_STAR*
tests/test_aligned/fastp_trimmed/*
tests/test_aligned/logs/*
tests/test_aligned/qc/*
tests/test_aligned/salmon_quant/*
debug*
621 changes: 621 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

13 changes: 13 additions & 0 deletions PAPA.Rproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Version: 1.0

RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: Sweave
LaTeX: pdfLaTeX
150 changes: 108 additions & 42 deletions README.md

Large diffs are not rendered by default.

182 changes: 71 additions & 111 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -1,149 +1,109 @@
import os
import pandas as pd
# PAPA Snakemake pipeline
# Copyright (C) 2024 Sam Bryce-Smith [email protected]

configfile: "config/config.yaml"
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.

def param_list(param):
'''
Return list of all param values converted to string
If param is not a list/iterable, coerced to a single value list
'''
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.

try:
param = list(param)
out = [str(p) for p in param]
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.

except TypeError:
# Not an iterable
out = [str(param)]

return out

import pandas as pd
import os
import sys

assert config["expression_merge_by"] in ["polyA", "last_exon"], f"'expression_merge_by' must be one of 'polyA' or 'last_exon' - {config['expression_merge_by']} was passed"
configfile: "config/config.yaml"

include: "rules/parse_config.py"

sample_tbl = pd.read_csv(config["sample_tbl"], index_col="sample_name")

# Check fastq2, if all empty then dataset is single end
if sample_tbl['fastq2'].isna().all():
single_end = True
else:
single_end = False

# Now double check that sample_name fields do not contain hyphens ('-') - will break column selection with R
if sample_tbl.index.str.contains("-", regex=False).any():
raise Exception(f"Values in 'sample_name' column of sample table must not contain hyphen characters ('-')")

SAMPLES = sample_tbl.index.tolist()
CONDITIONS = sample_tbl["condition"].tolist()
CONDITIONS = sample_tbl["condition"].drop_duplicates().tolist()
OPTIONS = sample_tbl.to_dict(orient="index")

GTF = config["annotation_gtf"]

# Make sure it has a slash at end of path
OUTPUT_DIR = os.path.join(config["main_output_dir"], "")
STRINGTIE_SUBDIR = os.path.join(OUTPUT_DIR, config["stringtie_subdir_name"], "")
TX_FILT_SUBDIR = os.path.join(OUTPUT_DIR, config["tx_filtering_subdir_name"], "")
SALMON_SUBDIR = os.path.join(OUTPUT_DIR, config["salmon_subdir_name"], "")
LOG_SUBDIR = os.path.join(OUTPUT_DIR, config["logs_subdir_name"], "")
BMARK_SUBDIR = os.path.join(OUTPUT_DIR, config["benchmarks_subdir_name"], "")
DAPA_SUBDIR = os.path.join(OUTPUT_DIR, config["diff_apa_subdir_name"], "")

min_frac_vals = param_list(config["min_isoform_fraction_abundance"])
min_jnc_vals = param_list(config["min_junction_reads"])
min_cov_vals = param_list(config["min_txipt_coverage"])

# print(OPTIONS)
# print(min_frac_vals)
# print(min_jnc_vals)
# print(min_cov_vals)
## Double check switches for workflow control are valid

# First check all are True/False booleans
assert isinstance(config["run_identification"], bool), f"'run_identification' must be True/False boolean, {config['run_identification']} (type {type(config['run_identification'])}) was provided"
assert isinstance(config["run_differential"], bool), f"'run_differential' must be True/False boolean, {config['run_differential']} (type {type(config['run_differential'])}) was provided"
assert isinstance(config["filter_ref_gtf"], bool), f"'filter_ref_gtf' must be True/False boolean, {config['filter_ref_gtf']} (type {type(config['filter_ref_gtf'])}) was provided"
assert isinstance(config["use_provided_novel_les"], bool), f"'use_provided_novel_les' must be True/False boolean, {config['use_provided_novel_les']} (type {type(config['use_provided_novel_les'])}) was provided"
assert isinstance(config["use_precomputed_salmon_index"], bool), f"'use_precomputed_salmon_index' must be True/False boolean, {config['use_precomputed_salmon_index']} (type {type(config['use_precomputed_salmon_index'])}) was provided"

# Now double check that valid combination of use_provided_novel_les & run_identification is provided (if applicable)
if config["use_provided_novel_les"]:
assert config["use_provided_novel_les"] and config["run_identification"], f"'use_provided_novel_les' is set to True but 'run_identification' is not. To continue using input novel last exons please set 'run_identification' to True"

# make a note if run_identification & use_precomputed_salmon_index are both True - run_id will be overrided and provided files used
if config["run_identification"] and config["use_precomputed_salmon_index"]:
sys.stderr.write("run_identification will be overridden and pre-provided salmon index, id and info tables will be used\n")


# If differential, make sure that sample table only has two conditions & to set a contrast name
if config["run_differential"]:
assert sample_tbl["condition"].nunique() == 2, f"condition column in sample table must only contain two distinct conditions, following n found - {sample_tbl['condition'].nunique()}"
# firs key in sample table condition column = base_key
BASE_KEY = sample_tbl["condition"][0]
CONTRAST_KEY = sample_tbl.loc[sample_tbl["condition"] != BASE_KEY, "condition"][0]
CONTRAST_NAME = CONTRAST_KEY + "vs" + BASE_KEY
sys.stderr.write(f"Inferred base key for condition - {BASE_KEY}\n")
sys.stderr.write(f"Inferred contrast key for condition - {CONTRAST_KEY}\n")
sys.stderr.write(f"Constructed contrast name - {CONTRAST_NAME}\n")


include: "rules/filter_gtf.smk"
include: "rules/stringtie.smk"
include: "rules/salmon.smk"
include: "rules/tx_filtering.smk"
include: "rules/salmon.smk"
include: "rules/differential_apa.smk"

# sys.stderr.write(OPTIONS + "\n")

localrules: all, gtf_list_by_condition, gtf_list_all_tpm_filtered
localrules: all, gtf_list_by_condition, gtf_list_all_tpm_filtered, check_per_sample_mean_tpm_filtered, make_formulas_txt

wildcard_constraints:
sample = "|".join(SAMPLES),
condition = "|".join(CONDITIONS)

rule all:
input:
# expand(os.path.join(STRINGTIE_SUBDIR,
# "min_jnc_{min_jnc}",
# "min_frac_{min_frac}",
# "min_cov_{min_cov}",
# # "{condition}",
# "{condition}.min_mean_tpm_filtered.gtf"),
# condition=CONDITIONS,
# min_jnc=min_jnc_vals,
# min_frac=min_frac_vals,
# min_cov=min_cov_vals),
expand(os.path.join(SALMON_SUBDIR,
"pas_quant",
"min_jnc_{min_jnc}",
"min_frac_{min_frac}",
"min_cov_{min_cov}",
"summarised_pas_quantification.tsv"),
min_jnc=min_jnc_vals,
min_frac=min_frac_vals,
min_cov=min_cov_vals),
# expand(os.path.join(SALMON_SUBDIR, "quant", "min_jnc_{min_jnc}", "min_frac_{min_frac}", "min_cov_{min_cov}", "{sample}","quant.sf"),
rules.process_dexseq_tbl.output if config["run_differential"] else rules.tx_to_le_quant.output.ppau,
rules.tx_to_le_quant.output.counts,
os.path.join(DAPA_SUBDIR,
"summarised_pas_quantification.tpm.tsv"),
os.path.join(DAPA_SUBDIR,
"summarised_pas_quantification.gene_tpm.tsv"),
# expand(os.path.join(SALMON_SUBDIR, "quant", "{sample}", "quant.sf"),
# sample=SAMPLES,
# min_jnc=min_jnc_vals,
# min_frac=min_frac_vals,
# min_cov=min_cov_vals
# ),
# expand(os.path.join(STRINGTIE_SUBDIR,
# "min_jnc_{min_jnc}",
# "min_frac_{min_frac}",
# "min_cov_{min_cov}",
# "tpm_filtered.intron_chain_filtered.3p_end_filtered.all_samples.combined.gtf"),
# min_jnc=min_jnc_vals,
# min_frac=min_frac_vals,
# min_cov=min_cov_vals)


def get_bam(sample, options, output_dir):
'''
Returns path to input bam file for given sample
If sample will undergo additional processing (not yet implemented), path will be output_dir/<processing_step>/{sample}.bam
If sample will not go additional processing, returns the path provided in the sample table/options
params:
sample <str>
name of sample (in pipeline context should usually pass as wildcards.sample)
options <dict>
dict of {sample: {param1: x, param2: y}} generated from input sample table
output_dir <str>
path to main output directory (results for each sample stored within here)
'''

if config["pre_stringtie_processing"] == "none":
return options[sample]["path"]

else:
raise ValueError("{} is invalid value for 'pre_stringtie_processing' option - please use 'none'".format(config["pre_stringtie_processing"]))

def get_sample_condition(sample, options):
'''
Return condition for given sample from options dict (sample table)
'''

return options[sample]["condition"]

def get_condition_samples(condition, options):

return [sample for sample in options.keys() if options[sample]["condition"] == condition]



# def get_stringtie_assembled(sample, output_dir):
# '''
# Return path to target StringTie transcriptome assembly
#
# Want functionality to provide a range of parameter values in same pipeline
# and Snakemake's Paramspace docs aren't quite cutting it right now...
#
# If provide a list for given parameter, will perform assembly for each combo of values
# min_isoform_fraction_abundance (-f)
# min_junction_reads (-j)
# min_transcript_coverage (-c) (minimum reads per bp coverage)
# To be added: disable_end_trimming (-t), point-features (--ptf)
# '''
#
# if isinstance(list(), config["min_isoform_fraction_abundance"]):

58 changes: 47 additions & 11 deletions config/cluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,39 @@ __default__:
h_rt: 2:00:00
submission_string: " "

filter_ref_gtf:
h_vmem: 8G
tmem: 8G
h_rt: 4:00:00
submission_string: " "

stringtie:
h_vmem: 14G
tmem: 14G
h_rt: 8:00:00
submission_string: " "

intron_chain_filter:
get_novel_last_exons:
h_vmem: 10G
tmem: 10G
h_rt: 8:00:00
submission_string: " "

transcript_filtering_tpm:
h_vmem: 8G
tmem: 8G
h_vmem: 12G
tmem: 12G
h_rt: 8:00:00
submission_string: " "

transcript_filtering_chain_3p:
h_vmem: 14G
tmem: 14G
transcript_filtering_3p:
h_vmem: 12G
tmem: 12G
h_rt: 8:00:00
submission_string: " "

three_end_filter:
h_vmem: 6G
tmem: 6G
h_vmem: 8G
tmem: 8G
h_rt: 4:00:00
submission_string: " "

Expand All @@ -40,9 +46,21 @@ merge_by_condition:
h_rt: 4:00:00
submission_string: " "

merge_filtered_with_ref:
h_vmem: 6G
tmem: 6G
combine_novel_by_condition:
h_vmem: 8G
tmem: 8G
h_rt: 4:00:00
submission_string: " "

combine_novel_filtered_by_condition:
h_vmem: 8G
tmem: 8G
h_rt: 4:00:00
submission_string: " "

get_combined_quant_gtf:
h_vmem: 8G
tmem: 8G
h_rt: 4:00:00
submission_string: " "

Expand All @@ -69,3 +87,21 @@ salmon_quant_pe:
tmem: 14G
submission_string: "-pe smp 2 -R y"
h_rt: 12:00:00

salmon_quant_se:
h_vmem: 12G
tmem: 12G
submission_string: "-pe smp 2 -R y"
h_rt: 12:00:00

tx_to_le_quant:
h_vmem: 8G
tmem: 8G
h_rt: 4:00:00
submission_string: " "

dexseq_apa:
h_vmem: 10G
tmem: 10G
submission_string: "-pe smp 2 -R y"
h_rt: 12:00:00
Loading

0 comments on commit 04a10a0

Please sign in to comment.