v0.2.0 (#39)

* filter_tx_by_cond - filters individual GTF files. Currently untested * initial commit new script to extract candidate novel last exons * get_novel_last_exons - _df_add not _df_annotate (script still untested * functions to id extension & spliced events (untested) * updates to get script working (internal seem to be missing?) * Fix logic of extract_format_exons_introns. Fix event type classification (don't add region rank to novel obj) * use read_gtf_restricted -saves ~30s reading in ref GTF vs full * if tx has multiple event types assigned, collapse to ',' sep string * output spliced last exons not last SJs * output spliced last exons not last SJs * skeleton of merge last exons script * working script to merge last_exon GTFs, assign sample & last_exon IDs * Script starts with last exons GTF, remove input match stats TSV. Report motif dist from 3'end. Select 3'end with min deviation from exptd position for motif only. Works at CL * select repr atlas site. Update 3'end to nearest atlas. Output updated/selected les to gtf, summary stats are per last exon ID * update function name for clarity of task * fnc to read gtf with extraction of custom GTF attributes only * initial commit get_quant_gtf - getting annoying read_gtf error. _fetch_attributes regex needs fixing (match space) * update conda env * get le ids for ext & no ext genes. Internal extension unique regions be weird... * output GTFs and tx2le, le2gene dfs * fix regex in _fetch_attributes - extracts correctly if has suffixed cols * add command line args * add option to not add last exon ID col * fixes to filter_tx_by_condition_tpm.py * I give up * fix check_stranded to ensure outputs stranded gr * list of conditions without duplicates * fix check_per_sample_mean_tpm_filtered. Some log name corrections * output 'not_found' string if no motifs founds - empty str attributes in GTF cause read error in future step (bug) * check_concat to ensure all dfs of gr have same columns * manually set strandedness for pr.subtract (bug in PyRanges) * Report n IDs dropped due to ref containment. sort assign dfs by le_id * tx_to_polya_quant.R - update CL help strings * rename ref_gene_id to gene_id in tx2gene, le2gene output * remove hardcoding of expected distance from 3'end * get_novel_last_exons.py - enforce strandedness="same" for pr.join calls * scripts/tx_to_polya_quant.R - fix variable name reference * Reorganise output so consistent x folders. Label stie tx_ids with sample ID (prevent edge case of same ID x samples (causes salmon index error)) * update cluster & rule group assignments * Output PPAU & gene total TPM matrices. Update R conda env. Set all count matrices as pipeline targets * options to switch on/off 5'end matching (exts & spliced). Exclude extension failing filter events from check for spliced events * output last exons not last 100nt. More strict on retained cols. Add 3p dist for min deviation motif as attribute. Report pairing counts for atlas/motif filters. * add flags to switch on/off 5'end matching (SJs) & 3'end matching * add new ward i3 2022 sample table * add ward i3 2022 sample table with reprocessed bams * remove duplicate Start -1 in read_gtf_specific (main repo PR 260) * correct config key for point features file * add option to make point features from standard BED * add function to collapse metadata cols for given id column * collapse_metadata - add check for collapse_cols found in individual df * get_novel_last_exons - try to fill NAs before output as GTF (causes parsing errors downstream) * get_novel_last_exons - Stop empty strings being output in attribute column * add option to collapse metadata cols whilst dropping duplicates * Add check for n column consistency in all dfs of PyRanges * drop 3'end filtering group rule... * fix bug with ','-sep novel ref_gene_ids strings being treated as separate genes to ref. Output info df with event types, coordinates, annot status etc. * Add checks for empty gr (for multi-ref gene matching) * drop 1-isoform genes with ref_gene_id not gene_id (novel isos dropped) * update cli to le2gene (and help message) * command line script to run basic diff usage analysis * quantify single-end files with Salmon * Optionally treat specifically labelled reference tx IDs as novel extensions * remove refs to multiple combos of stringtie parameters * initial commit generate test GTFs and genome FASTA * ensure txipts have/don't have PAS motifs as req. Output txipts FASTA * tidy up tx names etc. for test trs * script to generate simulated FQs for test transcripts * Generate test PAS atlas BED file * add tiny test data to repo (successful test run). Add check for consistent ncols in filter_tx_by_three_end * read_gtf_specific checks order of keys before extracting from attribute (prevents some parsing bugs) * report gene name in output GTF (as 'gene_name_ref') * report gene name in combined quant GTF. Also output le2genename TSV. Remove a few unnecessary keys from attribute output * quick note in config on hardcoded label * report summary dbrn of distances between predicted 3'ends and nearest atlas PAS * classify event types for annotated last exons. Report coords of full LE in .info.txt. Format with black * add diff usage with saturn script/option to pipeline (works with test data) * update gitignore to ignore all suffixed (default) test data output dirs * initial commit script to tidy up saturn tbl & add annot & ppau info * add process saturn results tbl script to pipeline (works with test data) * (incomplete) updates to readme & other documentation * initial commit - untested script to generate quant GTF from reference only. Add option to pipeline to only quantify * bug fix in annotate_le_ids_ref to get successful test run * fix annotation of event_types for ref events * add gene with just ref txipts to test data. Add a tx with a 'ref_extension_string' * update test data (reads & alignment) to include gene with just ref ALEs. * update test data - alignment mistake missed out chr3 reads. Also update FCs * option to skip identifying novel LEs & running differential testing. fix treating specific ref txs as separate les. All runs successful with test data * fix check for no extensions with provided string to raise exception properly * fix collapsing annotation info for le_ids bug. le_ids with inconsistent unique vals are not collapsed * update gitignore * fix gene_id_ref collapsing in case of NaNs * fix reporting count of events with ref_extension_string * Add option to use input GTF of last exons to combine with ref GTF. Extra checks on boolean flags for pipeline specification. Successful dry run * add script to combine & annotate predicted last exons from multiple experiments into a single gtf * Add an explicit check for hyphen characters in sample_name IDs from sample_tbl before declaring pipeline run * document requirement for no hyphens in sample_name column of sample table * optionally pass bias/other cl flags to salmon quant * dryrun - option to use precomputed salmon index and skip redundant construction * pass config file as command line arg * provide conda prefix as command line arg * correct salmon index path in params if use precomputed index * example of saturn output * Replace SatuRn with DEXSeq (#47) * switch saturn for dexseq in differential_apa * remove mentions to saturn and associated scripts * add dexseq_apa to cluster.yaml * Clean up README * update readme * update docs on quant & differential output files * add note on specifying base condition in sample table * add docs on final combined novel GTF * add docs on combined GTFs * notes on ID mapping files. Remove redundancy for event_type descriptions * remove sge profile (not working) * minor updates to README * remove deprecated parameters * link to example files * remove empty md * remove deprecated scripts * add license information * tiny formatting change
frattalab · Jan 19, 2024 · 04a10a0 · 04a10a0
1 parent e2f7a81
commit 04a10a0
Show file tree

Hide file tree

Showing 92 changed files with 13,827 additions and 13,932 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,3 @@
-
 *.log
 .snakemake/.DS_Store
 .DS_Store
@@ -12,3 +11,13 @@ data/annotation/
 data/Cont*
 data/TDP43*
 scripts/.ipynb_checkpoints/
+.Rproj.user
+test_data_output*
+tests/test_aligned/STAR_aligned/*.out
+tests/test_aligned/STAR_aligned/*.out.t*
+tests/test_aligned/STAR_aligned/*_STAR*
+tests/test_aligned/fastp_trimmed/*
+tests/test_aligned/logs/*
+tests/test_aligned/qc/*
+tests/test_aligned/salmon_quant/*
+debug*
diff --git a/LICENSE b/LICENSE
diff --git a/PAPA.Rproj b/PAPA.Rproj
@@ -0,0 +1,13 @@
+Version: 1.0
+
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+
+RnwWeave: Sweave
+LaTeX: pdfLaTeX
diff --git a/README.md b/README.md
diff --git a/Snakefile b/Snakefile
@@ -1,149 +1,109 @@
-import os
-import pandas as pd
+#     PAPA Snakemake pipeline
+#     Copyright (C) 2024  Sam Bryce-Smith [email protected]
 
-configfile: "config/config.yaml"
+#     This program is free software: you can redistribute it and/or modify
+#     it under the terms of the GNU General Public License as published by
+#     the Free Software Foundation, either version 3 of the License, or
+#     (at your option) any later version.
 
-def param_list(param):
-    '''
-    Return list of all param values converted to string
-    If param is not a list/iterable, coerced to a single value list
-    '''
+#     This program is distributed in the hope that it will be useful,
+#     but WITHOUT ANY WARRANTY; without even the implied warranty of
+#     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#     GNU General Public License for more details.
 
-    try:
-        param = list(param)
-        out = [str(p) for p in param]
+#     You should have received a copy of the GNU General Public License
+#     along with this program.  If not, see <https://www.gnu.org/licenses/>.
 
-    except TypeError:
-        # Not an iterable
-        out = [str(param)]
-
-    return out
 
+import pandas as pd
+import os
+import sys
 
-assert config["expression_merge_by"] in ["polyA", "last_exon"], f"'expression_merge_by' must be one of 'polyA' or 'last_exon' - {config['expression_merge_by']} was passed"
+configfile: "config/config.yaml"
 
+include: "rules/parse_config.py"
 
 sample_tbl = pd.read_csv(config["sample_tbl"], index_col="sample_name")
 
+# Check fastq2, if all empty then dataset is single end
+if sample_tbl['fastq2'].isna().all():
+    single_end = True
+else:
+    single_end = False
+
+# Now double check that sample_name fields do not contain hyphens ('-') - will break column selection with R
+if sample_tbl.index.str.contains("-", regex=False).any():
+    raise Exception(f"Values in 'sample_name' column of sample table must not contain hyphen characters ('-')")
+
 SAMPLES = sample_tbl.index.tolist()
-CONDITIONS = sample_tbl["condition"].tolist()
+CONDITIONS = sample_tbl["condition"].drop_duplicates().tolist()
 OPTIONS = sample_tbl.to_dict(orient="index")
 
 GTF = config["annotation_gtf"]
 
 # Make sure it has a slash at end of path
 OUTPUT_DIR = os.path.join(config["main_output_dir"], "")
 STRINGTIE_SUBDIR = os.path.join(OUTPUT_DIR, config["stringtie_subdir_name"], "")
+TX_FILT_SUBDIR = os.path.join(OUTPUT_DIR, config["tx_filtering_subdir_name"], "")
 SALMON_SUBDIR = os.path.join(OUTPUT_DIR, config["salmon_subdir_name"], "")
 LOG_SUBDIR = os.path.join(OUTPUT_DIR, config["logs_subdir_name"], "")
+BMARK_SUBDIR = os.path.join(OUTPUT_DIR, config["benchmarks_subdir_name"], "")
 DAPA_SUBDIR = os.path.join(OUTPUT_DIR, config["diff_apa_subdir_name"], "")
 
-min_frac_vals = param_list(config["min_isoform_fraction_abundance"])
-min_jnc_vals = param_list(config["min_junction_reads"])
-min_cov_vals = param_list(config["min_txipt_coverage"])
 
-# print(OPTIONS)
-# print(min_frac_vals)
-# print(min_jnc_vals)
-# print(min_cov_vals)
+## Double check switches for workflow control are valid
 
+# First check all are True/False booleans
+assert isinstance(config["run_identification"], bool), f"'run_identification' must be True/False boolean, {config['run_identification']} (type {type(config['run_identification'])}) was provided"
+assert isinstance(config["run_differential"], bool), f"'run_differential' must be True/False boolean, {config['run_differential']} (type {type(config['run_differential'])}) was provided"
+assert isinstance(config["filter_ref_gtf"], bool), f"'filter_ref_gtf' must be True/False boolean, {config['filter_ref_gtf']} (type {type(config['filter_ref_gtf'])}) was provided"
+assert isinstance(config["use_provided_novel_les"], bool), f"'use_provided_novel_les' must be True/False boolean, {config['use_provided_novel_les']} (type {type(config['use_provided_novel_les'])}) was provided"
+assert isinstance(config["use_precomputed_salmon_index"], bool), f"'use_precomputed_salmon_index' must be True/False boolean, {config['use_precomputed_salmon_index']} (type {type(config['use_precomputed_salmon_index'])}) was provided"
+
+# Now double check that valid combination of use_provided_novel_les & run_identification is provided (if applicable)
+if config["use_provided_novel_les"]:
+    assert config["use_provided_novel_les"] and config["run_identification"], f"'use_provided_novel_les' is set to True but 'run_identification' is not. To continue using input novel last exons please set 'run_identification' to True"
+
+# make a note if run_identification & use_precomputed_salmon_index are both True - run_id will be overrided and provided files used
+if config["run_identification"] and config["use_precomputed_salmon_index"]:
+    sys.stderr.write("run_identification will be overridden and pre-provided salmon index, id and info tables will be used\n")
+
+
+# If differential, make sure that sample table only has two conditions & to set a contrast name
+if config["run_differential"]:
+    assert sample_tbl["condition"].nunique() == 2, f"condition column in sample table must only contain two distinct conditions, following n found - {sample_tbl['condition'].nunique()}"
+    # firs key in sample table condition column = base_key
+    BASE_KEY = sample_tbl["condition"][0]
+    CONTRAST_KEY = sample_tbl.loc[sample_tbl["condition"] != BASE_KEY, "condition"][0]
+    CONTRAST_NAME = CONTRAST_KEY + "vs" + BASE_KEY
+    sys.stderr.write(f"Inferred base key for condition - {BASE_KEY}\n")
+    sys.stderr.write(f"Inferred contrast key for condition - {CONTRAST_KEY}\n")
+    sys.stderr.write(f"Constructed contrast name - {CONTRAST_NAME}\n")
+
+
+include: "rules/filter_gtf.smk"
 include: "rules/stringtie.smk"
-include: "rules/salmon.smk"
 include: "rules/tx_filtering.smk"
+include: "rules/salmon.smk"
 include: "rules/differential_apa.smk"
 
+# sys.stderr.write(OPTIONS + "\n")
 
-localrules: all, gtf_list_by_condition, gtf_list_all_tpm_filtered
+localrules: all, gtf_list_by_condition, gtf_list_all_tpm_filtered, check_per_sample_mean_tpm_filtered, make_formulas_txt
 
 wildcard_constraints:
     sample = "|".join(SAMPLES),
     condition = "|".join(CONDITIONS)
 
 rule all:
     input:
-        # expand(os.path.join(STRINGTIE_SUBDIR,
-        #                     "min_jnc_{min_jnc}",
-        #                     "min_frac_{min_frac}",
-        #                     "min_cov_{min_cov}",
-        #                     # "{condition}",
-        #                     "{condition}.min_mean_tpm_filtered.gtf"),
-        #        condition=CONDITIONS,
-        #        min_jnc=min_jnc_vals,
-        #        min_frac=min_frac_vals,
-        #        min_cov=min_cov_vals),
-        expand(os.path.join(SALMON_SUBDIR,
-                             "pas_quant",
-                             "min_jnc_{min_jnc}",
-                             "min_frac_{min_frac}",
-                             "min_cov_{min_cov}",
-                             "summarised_pas_quantification.tsv"),
-               min_jnc=min_jnc_vals,
-               min_frac=min_frac_vals,
-               min_cov=min_cov_vals),
-        # expand(os.path.join(SALMON_SUBDIR, "quant", "min_jnc_{min_jnc}", "min_frac_{min_frac}", "min_cov_{min_cov}", "{sample}","quant.sf"),
+        rules.process_dexseq_tbl.output if config["run_differential"] else rules.tx_to_le_quant.output.ppau,
+        rules.tx_to_le_quant.output.counts,
+        os.path.join(DAPA_SUBDIR,
+                     "summarised_pas_quantification.tpm.tsv"),
+        os.path.join(DAPA_SUBDIR,
+                     "summarised_pas_quantification.gene_tpm.tsv"),
+        # expand(os.path.join(SALMON_SUBDIR, "quant", "{sample}", "quant.sf"),
         #        sample=SAMPLES,
-        #        min_jnc=min_jnc_vals,
-        #        min_frac=min_frac_vals,
-        #        min_cov=min_cov_vals
         #        ),
-        # expand(os.path.join(STRINGTIE_SUBDIR,
-        #                     "min_jnc_{min_jnc}",
-        #                     "min_frac_{min_frac}",
-        #                     "min_cov_{min_cov}",
-        #                     "tpm_filtered.intron_chain_filtered.3p_end_filtered.all_samples.combined.gtf"),
-        #        min_jnc=min_jnc_vals,
-        #        min_frac=min_frac_vals,
-        #        min_cov=min_cov_vals)
-
-
-def get_bam(sample, options, output_dir):
-    '''
-    Returns path to input bam file for given sample
-    If sample will undergo additional processing (not yet implemented), path will be output_dir/<processing_step>/{sample}.bam
-    If sample will not go additional processing, returns the path provided in the sample table/options
-
-    params:
-        sample <str>
-        name of sample (in pipeline context should usually pass as wildcards.sample)
-
-        options <dict>
-        dict of {sample: {param1: x, param2: y}} generated from input sample table
-
-        output_dir <str>
-        path to main output directory (results for each sample stored within here)
-    '''
-
-    if config["pre_stringtie_processing"] == "none":
-        return options[sample]["path"]
-
-    else:
-        raise ValueError("{} is invalid value for 'pre_stringtie_processing' option - please use 'none'".format(config["pre_stringtie_processing"]))
-
-def get_sample_condition(sample, options):
-    '''
-    Return condition for given sample from options dict (sample table)
-    '''
-
-    return options[sample]["condition"]
-
-def get_condition_samples(condition, options):
-
-    return [sample for sample in options.keys() if options[sample]["condition"] == condition]
-
-
-
-# def get_stringtie_assembled(sample, output_dir):
-#     '''
-#     Return path to target StringTie transcriptome assembly
-#
-#     Want functionality to provide a range of parameter values in same pipeline
-#     and Snakemake's Paramspace docs aren't quite cutting it right now...
-#
-#     If provide a list for given parameter, will perform assembly for each combo of values
-#     min_isoform_fraction_abundance (-f)
-#     min_junction_reads (-j)
-#     min_transcript_coverage (-c) (minimum reads per bp coverage)
-#     To be added: disable_end_trimming (-t), point-features (--ptf)
-#     '''
-#
-#     if isinstance(list(), config["min_isoform_fraction_abundance"]):
+
diff --git a/config/cluster.yaml b/config/cluster.yaml
@@ -4,33 +4,39 @@ __default__:
   h_rt:  2:00:00
   submission_string: " "
 
+filter_ref_gtf:
+  h_vmem: 8G
+  tmem: 8G
+  h_rt:  4:00:00
+  submission_string: " "
+
 stringtie:
   h_vmem: 14G
   tmem: 14G
   h_rt:  8:00:00
   submission_string: " "
 
-intron_chain_filter:
+get_novel_last_exons:
   h_vmem: 10G
   tmem: 10G
   h_rt:  8:00:00
   submission_string: " "
 
 transcript_filtering_tpm:
-  h_vmem: 8G
-  tmem: 8G
+  h_vmem: 12G
+  tmem: 12G
   h_rt:  8:00:00
   submission_string: " "
 
-transcript_filtering_chain_3p:
-  h_vmem: 14G
-  tmem: 14G
+transcript_filtering_3p:
+  h_vmem: 12G
+  tmem: 12G
   h_rt:  8:00:00
   submission_string: " "
 
 three_end_filter:
-  h_vmem: 6G
-  tmem: 6G
+  h_vmem: 8G
+  tmem: 8G
   h_rt:  4:00:00
   submission_string: " "
 
@@ -40,9 +46,21 @@ merge_by_condition:
   h_rt:  4:00:00
   submission_string: " "
 
-merge_filtered_with_ref:
-  h_vmem: 6G
-  tmem: 6G
+combine_novel_by_condition:
+  h_vmem: 8G
+  tmem: 8G
+  h_rt:  4:00:00
+  submission_string: " "
+
+combine_novel_filtered_by_condition:
+  h_vmem: 8G
+  tmem: 8G
+  h_rt:  4:00:00
+  submission_string: " "
+
+get_combined_quant_gtf:
+  h_vmem: 8G
+  tmem: 8G
   h_rt:  4:00:00
   submission_string: " "
 
@@ -69,3 +87,21 @@ salmon_quant_pe:
   tmem: 14G
   submission_string: "-pe smp 2 -R y"
   h_rt: 12:00:00
+
+salmon_quant_se:
+  h_vmem: 12G
+  tmem: 12G
+  submission_string: "-pe smp 2 -R y"
+  h_rt: 12:00:00
+
+tx_to_le_quant:
+  h_vmem: 8G
+  tmem: 8G
+  h_rt:  4:00:00
+  submission_string: " "
+
+dexseq_apa:
+  h_vmem: 10G
+  tmem: 10G
+  submission_string: "-pe smp 2 -R y"
+  h_rt: 12:00:00