Various workflows, scripts and functions

himap.R

Rscript: bin/himap.R
This is a wrapper around the HiMAP package for 16S rDNA analysis. It processes fastq files to Operational Strain Units (OSU) abundance. The script is intended to be run from the command line. Try himap.R -h for help.

Usage:
Download the himap.R file and put it in your $PATH (e.g. $HOME/bin). For example in a terminal:

wget https://raw.githubusercontent.com/angelovangel/etc/master/bin/himap.R
mv himap.R $HOME/bin
chmod a+x $HOME/bin/himap.R

After that, the script can be run directly from terminal (but adjust the shebang to your system). Try himap.R --help as a start, it will give you an idea of what you need to run it. The script will also attempt to install the required R packages, if missing.

Merge lane-splitted `fastq` files

Function: bin/mergefq_name.R

Required: stringr
Sometimes you get fastq files which are split by lane, that is, the sequences from one sample are split in several different directories and files. This is the default behaviour of Basespace, or if the bcl2fastq was executed without the --no-lane-splitting option. The file structure in such cases might look like this:

├── 201903MW-125089967
│   └── FASTQ_Generation_2019-04-26_21_28_46Z-176397177
│       ├── MW2_100_L001-ds.a02e85d3e8f84a8ba6ef8909ea3fec30
│       │   ├── MW2-100_S4_L001_R1_001.fastq.gz
│       │   └── MW2-100_S4_L001_R2_001.fastq.gz
│       ├── MW2_100_L002-ds.3f91c8aa99f44b159d621bdbaecd5611
│       │   ├── MW2-100_S4_L002_R1_001.fastq.gz
│       │   └── MW2-100_S4_L002_R2_001.fastq.gz
...

This function merges the fastq files belonging to one sample, keeping the forward and reverse reads separate.

Usage:
Source the file and use the function in an R session. The mergefq_name() function takes these arguments:

fqdir - where to start looking for fastq files (the search is recursive). Default is the current directory.
pattern - regex pattern matching the sampleID part of the file.
dryrun - logical, whether to actually do the merge. Default is FALSE

devtools::source_url("https://raw.githubusercontent.com/angelovangel/etc/master/bin/mergefq_name.R") 
mergefq_name(pattern = "sampleID_regex_pattern")

Merge re-sequenced Illumina libraries

Function: bin/mergefq_reseq.R

Required: stringr magrittr

For example, sample (or library) named AA_01 was resequenced and the reseq was named AA2_01.
The original and the reseq files are (they will be usually in different folders):

 | old_seqs
    | AA_01_S4_R1_001.fastq
    | AA_01_S4_R2_001.fastq
    | AA_02_S5_R1_001.fastq
    | AA_02_S5_R2_00 1.fastq
 | new_seqs
    | AA2_01_S7_R1_001.fastq
    | AA2_01_S7_R2_001.fastq
    | AA2_02_S8_R1_001.fastq
    | AA2_02_S8_R2_001.fastq

Usage:

Download bacterial genomes from NCBI

Rscript: bin/get_refseq_genomes.R

Required: data.table, dplyr, purrr, optparse, stringr
This R script downloads bacterial genomes from ncbi, using an assembly_summary.txt to filter and get ftp paths to genomes of interest. System calls to rsync are used, so rsync has to be available on the system.

Usage: In a terminal, do bin/get_refseq_genomes.R -h to get an idea how to use it. For example, to download all representative and complete genomes, using a fresh assembly_summary.txt from ncbi :

get_refseq_genomes.R -urcd

The script supports filtering by taxid, i.e. get_refseq_genomes.R -u -d -t 1423 will download all Bacillus subtilis genomes. A great program for working with taxonomy data (and to get all taxids for a certain phylogenetic rank) is taxonkit.

Find the edit distance in two sets of sequences

Function: bin/edist.R
Function to count the number of differences between two strings of equal length, to be used for e.g. index pool designs in Illumina seq

Required: stringr

Usage:
INPUT
a = string1, b = string2, string2 can be a string vector!!!

OUTPUT
integer, edit distance (number of changes needed to get from a to b). If b is a string vector, a list is returned with position(minpos), edit distance (minedist) and sequence (minseq) of the most similar string in the vector

# for a single a vector, b can be a list:
edist(a, b)

# for a list of a arguments:
map_df(a, edist, b) 

# or even 
cbind(celero, map_df(celero$index1, edist, xt$i7_bases))

Advanced usage:
To check one character vector (charvec) of indices for the next closest index:

map_df(1:length(charvec), function(x) { edist(charvec[x], charvec[-x]) })

Subset protein `fasta` file

Function: bin/subset_proteins.R
This R function takes a string vector of protein fasta headers and a protein fasta file and returns the protein sequences with matching headers. Partial or exact match is supported (partial means that the string from beginning of the line upto the next whitespace is used in the search).

Required: Biostrings, stringr

Usage:
In an R session, do:

devtools::source_url("https://raw.githubusercontent.com/angelovangel/etc/master/bin/subset_proteins.R")

Then the function subset_proteins() should be available in your session.

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
.vscode		.vscode
bin		bin
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Various workflows, scripts and functions

himap.R

Merge lane-splitted `fastq` files

Merge re-sequenced Illumina libraries

Download bacterial genomes from NCBI

Find the edit distance in two sets of sequences

Subset protein `fasta` file

About

Releases

Packages

Languages

angelovangel/etc

Folders and files

Latest commit

History

Repository files navigation

Various workflows, scripts and functions

himap.R

Merge lane-splitted fastq files

Merge re-sequenced Illumina libraries

Download bacterial genomes from NCBI

Find the edit distance in two sets of sequences

Subset protein fasta file

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Merge lane-splitted `fastq` files

Subset protein `fasta` file

Packages