Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* filter_tx_by_cond - filters individual GTF files. Currently untested * initial commit new script to extract candidate novel last exons * get_novel_last_exons - _df_add not _df_annotate (script still untested * functions to id extension & spliced events (untested) * updates to get script working (internal seem to be missing?) * Fix logic of extract_format_exons_introns. Fix event type classification (don't add region rank to novel obj) * use read_gtf_restricted -saves ~30s reading in ref GTF vs full * if tx has multiple event types assigned, collapse to ',' sep string * output spliced last exons not last SJs * output spliced last exons not last SJs * skeleton of merge last exons script * working script to merge last_exon GTFs, assign sample & last_exon IDs * Script starts with last exons GTF, remove input match stats TSV. Report motif dist from 3'end. Select 3'end with min deviation from exptd position for motif only. Works at CL * select repr atlas site. Update 3'end to nearest atlas. Output updated/selected les to gtf, summary stats are per last exon ID * update function name for clarity of task * fnc to read gtf with extraction of custom GTF attributes only * initial commit get_quant_gtf - getting annoying read_gtf error. _fetch_attributes regex needs fixing (match space) * update conda env * get le ids for ext & no ext genes. Internal extension unique regions be weird... * output GTFs and tx2le, le2gene dfs * fix regex in _fetch_attributes - extracts correctly if has suffixed cols * add command line args * add option to not add last exon ID col * fixes to filter_tx_by_condition_tpm.py * I give up * fix check_stranded to ensure outputs stranded gr * list of conditions without duplicates * fix check_per_sample_mean_tpm_filtered. Some log name corrections * output 'not_found' string if no motifs founds - empty str attributes in GTF cause read error in future step (bug) * check_concat to ensure all dfs of gr have same columns * manually set strandedness for pr.subtract (bug in PyRanges) * Report n IDs dropped due to ref containment. sort assign dfs by le_id * tx_to_polya_quant.R - update CL help strings * rename ref_gene_id to gene_id in tx2gene, le2gene output * remove hardcoding of expected distance from 3'end * get_novel_last_exons.py - enforce strandedness="same" for pr.join calls * scripts/tx_to_polya_quant.R - fix variable name reference * Reorganise output so consistent x folders. Label stie tx_ids with sample ID (prevent edge case of same ID x samples (causes salmon index error)) * update cluster & rule group assignments * Output PPAU & gene total TPM matrices. Update R conda env. Set all count matrices as pipeline targets * options to switch on/off 5'end matching (exts & spliced). Exclude extension failing filter events from check for spliced events * output last exons not last 100nt. More strict on retained cols. Add 3p dist for min deviation motif as attribute. Report pairing counts for atlas/motif filters. * add flags to switch on/off 5'end matching (SJs) & 3'end matching * add new ward i3 2022 sample table * add ward i3 2022 sample table with reprocessed bams * remove duplicate Start -1 in read_gtf_specific (main repo PR 260) * correct config key for point features file * add option to make point features from standard BED * add function to collapse metadata cols for given id column * collapse_metadata - add check for collapse_cols found in individual df * get_novel_last_exons - try to fill NAs before output as GTF (causes parsing errors downstream) * get_novel_last_exons - Stop empty strings being output in attribute column * add option to collapse metadata cols whilst dropping duplicates * Add check for n column consistency in all dfs of PyRanges * drop 3'end filtering group rule... * fix bug with ','-sep novel ref_gene_ids strings being treated as separate genes to ref. Output info df with event types, coordinates, annot status etc. * Add checks for empty gr (for multi-ref gene matching) * drop 1-isoform genes with ref_gene_id not gene_id (novel isos dropped) * update cli to le2gene (and help message) * command line script to run basic diff usage analysis * quantify single-end files with Salmon * Optionally treat specifically labelled reference tx IDs as novel extensions * remove refs to multiple combos of stringtie parameters * initial commit generate test GTFs and genome FASTA * ensure txipts have/don't have PAS motifs as req. Output txipts FASTA * tidy up tx names etc. for test trs * script to generate simulated FQs for test transcripts * Generate test PAS atlas BED file * add tiny test data to repo (successful test run). Add check for consistent ncols in filter_tx_by_three_end * read_gtf_specific checks order of keys before extracting from attribute (prevents some parsing bugs) * report gene name in output GTF (as 'gene_name_ref') * report gene name in combined quant GTF. Also output le2genename TSV. Remove a few unnecessary keys from attribute output * quick note in config on hardcoded label * report summary dbrn of distances between predicted 3'ends and nearest atlas PAS * classify event types for annotated last exons. Report coords of full LE in .info.txt. Format with black * add diff usage with saturn script/option to pipeline (works with test data) * update gitignore to ignore all suffixed (default) test data output dirs * initial commit script to tidy up saturn tbl & add annot & ppau info * add process saturn results tbl script to pipeline (works with test data) * (incomplete) updates to readme & other documentation * initial commit - untested script to generate quant GTF from reference only. Add option to pipeline to only quantify * bug fix in annotate_le_ids_ref to get successful test run * fix annotation of event_types for ref events * add gene with just ref txipts to test data. Add a tx with a 'ref_extension_string' * update test data (reads & alignment) to include gene with just ref ALEs. * update test data - alignment mistake missed out chr3 reads. Also update FCs * option to skip identifying novel LEs & running differential testing. fix treating specific ref txs as separate les. All runs successful with test data * fix check for no extensions with provided string to raise exception properly * fix collapsing annotation info for le_ids bug. le_ids with inconsistent unique vals are not collapsed * update gitignore * fix gene_id_ref collapsing in case of NaNs * fix reporting count of events with ref_extension_string * Add option to use input GTF of last exons to combine with ref GTF. Extra checks on boolean flags for pipeline specification. Successful dry run * add script to combine & annotate predicted last exons from multiple experiments into a single gtf * Add an explicit check for hyphen characters in sample_name IDs from sample_tbl before declaring pipeline run * document requirement for no hyphens in sample_name column of sample table * optionally pass bias/other cl flags to salmon quant * dryrun - option to use precomputed salmon index and skip redundant construction * pass config file as command line arg * provide conda prefix as command line arg * correct salmon index path in params if use precomputed index * example of saturn output * Replace SatuRn with DEXSeq (#47) * switch saturn for dexseq in differential_apa * remove mentions to saturn and associated scripts * add dexseq_apa to cluster.yaml * Clean up README * update readme * update docs on quant & differential output files * add note on specifying base condition in sample table * add docs on final combined novel GTF * add docs on combined GTFs * notes on ID mapping files. Remove redundancy for event_type descriptions * remove sge profile (not working) * minor updates to README * remove deprecated parameters * link to example files * remove empty md * remove deprecated scripts * add license information * tiny formatting change
- Loading branch information