Skip to content

A pipeline to find & extract 4-fold degenerate sites on a SLURM cluster

Notifications You must be signed in to change notification settings

mattheatley/extract_4fds

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

READ ME

This software finds 4-fold degenerate sites (4fds) within an annotated fasta sequence & extracts them from a vcf file on a SLURM cluster.



Requirements
  
  - python 3.7.3; os, sys, subprocess, shutil, re, argparse, math
  - samtools 1.9
  - gatk4 4.1.4.0
  
N.B. 
 If software is installed within a conda environment the environment name must be supplied at each stage (see below).



SETUP

Create the initial default/user-specified working & sub-directories.

  python pipe_4fds.py -setup [-p </path/to>] [-d <directory_name>]



PREPARE FILES

Move the fasta, gff & (if also extracting sites) vcf files to the relevant sub-directories.

i.e.

  /ref.gen
    sequence.fasta
    
  /gff
    annotation.gff

  /vcf
    variants.vcf.gz



INDEX

Index reference genome.

  python pipe_4fds.py -index



FIND SITES

Find 4fds coordinates within the fasta sequence. 

  python pipe_4fds.py -find
  
N.B. 
 Two 4fds categories are reported; (i) all 4fds & (ii) those 4fds within all transcripts of a given gene.
 4fds within codons disrupted by introns (i.e. phase-1 & phase-2) are also identified.
 Any unassigned CDS/mRNA (i.e. parent not found) or ambiguous transcripts (i.e. indiscernible codons) are ignored.
 (see unassigned_CDS, unassigned_mRNA & ignore_dtranscripts respectively in the output directory)


EXTRACT SITES

Extract 4fds for individual scaffolds from a vcf file. 

  python pipe_4fds.py -extract {all|consistent}




MERGE EXTRACTED SITES

Merge extracted per-scaffold 4fds vcfs into a genome-wide 4fds vcf. 

  python pipe_4fds.py -merge {all|consistent}
  
  

ADDITIONAL SETTINGS

  Flag                    Default             Description

  -pa <partition>         stage-specific      specify SLURM partition
  -no <nodes>             stage-specific      specify SLURM nodes
  -nt <ntasks>            stage-specific      specify SLURM ntasks
  -me <memory[units]>     stage-specific      specify SLURM memory
  -wt <HH:MM:SS>          stage-specific      specify SLURM walltime
 
  -d <diretory_name>      ngs_pipe            specify working directory name
  -p </path/to>           home path           specify working directory path
  -u <user_name>          current user        specify SLURM user name
  -m <email_address>      None                specify email address to receive SLURM notifications
  -e <environment_name>   ngs_env             specify conda environment with software installed 
  -l <limit>              100                 specify concurrent task submission limit
  
  -pipe                                       submit pipeline itself; submits new tasks up to the limit from the hpcc 
  

About

A pipeline to find & extract 4-fold degenerate sites on a SLURM cluster

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages