A repository to store scripts identifying features such as repeats and editing in nucleotide sequences.
A script to identify anchors for which all targets 2...n are differentiated from target 1 via a single kind of single-nucleotide variation, i.e. only A to G mutations or only C to T mutations.
Usage: sbatch edit_caller.sh ${inputDt
} ${intermediateOutDt
} ${outDt
}
1. inputDt
- a file having columns anchor
, target
or extendor
, dataset
, anchor_count
, target_count
.
2. intermediateOutdt
- intermediate output reporting, for anchor, dataset pairs having edit calls:
anchor
, dataset
, target
, edited base (edited_from
), resulting base (edited_to
), the # edited positions for each target (num_edits
).
3. outDt
- final output, reporting fields in intermediateOutdt
, but left merged onto the input.
A script to identify nonoverlapping repeats occurring in nucleotide sequences. The largest and left-most k-mer is reported such that, without overlapping, the k-mer appears at least once elsewhere in the compactor sequence with Hamming distance <= 2 to its left-most occurrence. If there are multiple such k-mers, those having the lowest Hamming distance and furthest to the left are selected and their sequence coordinates 'blocked out' until no further k-mers with Hamming distance <= 2 can be added to the set we report on the basis of introducing overlaps or exceeding the Hamming distance threshold.
Usage: sbatch generic_repeat.sh ${inputDt
} ${outputDt
} ${sequence_column_name
}
1. inputDt
- A file having at minimum the column sequence_column_name
.
2. outputDt
- inputDt
, modified to have the following columns:
a. generic_repeat
the 'left-most' repeat sequence; b. generic_repeat_size
the repeat's k-mer size;
c. generic_repeat_positions
the 0-indexed sequence coordinates at which the repeat appears;
d. generic_repeat_Hamming_distances
the Hamming distances of the repeats whose positions are reported in generic_repeat_positions
;
e. generic_repeat_mean_Hamming_distance
the mean of the reported Hamming distances. Note that Hamming distances are not reported circularly, or in comparing the left-most repeat occurrence to itself;
f.generic_repeat_Xmer_entropy
for X in [3,4,5], the Shannon entropy of the distribution of kmer counts when tiled from the repeat.
A script to identify nonoverlapping repeats occurring at a fixed interval in nucleotide sequences. The largest and left-most k-mer is reported such that, at a fixed interval, the k-mer occurs downstream in the sequence at least 3 times contiguously (meaning repeat-interval-repeat-interval-repeat) where the repeat's occurrences are Hamming distance <= 2 to its left-most occurrence. The search is performed such that the k-mer size and interval are maximized.
Usage: sbatch periodic_repeat.sh Usage: sbatch generic_repeat.sh ${inputDt
} ${outputDt
} ${sequence_column_name
}
1. inputDt
- A file having at minimum the column sequence_column_name
.
2. outputDt
- inputDt
, modified to have the following columns:
a. periodic_repeat
the 'left-most' repeat sequence; b. repeat period
the interval between which repeats appear;
c. periodic_repeat_positions
the 0-indexed sequence coordinates at which the repeat appears;
d. periodic_repeat_Hamming_distances
the Hamming distances of the repeats whose positions are reported in periodic_repeat_positions
;
e. periodic_repeat_mean_Hamming_distance
the mean of the reported Hamming distances. Note that Hamming distances are not reported circularly, or in comparing the left-most repeat occurrence to itself;
f.periodic_repeat_Xmer_entropy
for X in [3,4,5], the Shannon entropy of the distribution of kmer counts when tiled from the repeat;
g.periodic_repeat_length
the length of periodic_repeat
.
Usage here is simple. Make and move to a working directory specific to this instance of the job. Do sbatch mmseqs_search.sh ${input_fasta} ${input_fasta_database_name}
. MMseqs2 will produce a database having this name from this FASTA, and it will search against my installation of the UniProtKB protein database.
Simply move to a specific working directory and do sbatch run_orfipy.sh ${input_fasta}
. The script will extract ORFs of minimum length 36nt, where partial 3' or 5' fragments are permitted. This means that the ORF will still be reported even if it is not enclosed by the opposite end (if read length ends before the ORF is closed).
Move to a specific working directory and do sbatch mmseqs2_build_and_cluster.sh ${input_protein_fasta} ${clustered_database_name (output)} ${fraction_coverage}
. Fraction coverage identifies the tolerance the aligned window size between a query sequence and target for cluster membership. In this code, I use --cov_mode 0, which means the target can cover 80% of the query or vice versa when considering membership.