Scripts used for the analysis of the P. cactorum isolate 10300. Note - all this work was performed in the directory: /home/groups/harrisonlab/project_files/idris . As such this script relies upon adhering to a specific directory structure.
The following is a summary of the work presented in this Readme.
The following processes were applied to the P. cactorum 10300 genome prior to analysis:
Repeatmasking Gene prediction Functional annotation
Analyses performed on these genomes involved BLAST searching for: Known Avr genes Ab initio prediction of putative RxLR, CRN and SSC effectors.
#Making local repositories for data
Data was downloaded for P. infestans isoalte T30-4 from ... This data was moved into the following directories:
Organism=P.infestans
Strain=T30-4
ProjDir=/home/groups/harrisonlab/project_files/idris
cd $ProjDir
mkdir -p assembly/external_group/$Organism/$Strain/cdna
mkdir -p assembly/external_group/$Organism/$Strain/cds
mkdir -p assembly/external_group/$Organism/$Strain/dna
mkdir -p assembly/external_group/$Organism/$Strain/ncrna
mkdir -p assembly/external_group/$Organism/$Strain/pep
Data was downloaded for P. parasitica isolate 310 from ... This data was moved into the following directories:
Organism=P.parasitica
Strain=310
ProjDir=/home/groups/harrisonlab/project_files/idris
cd $ProjDir
mkdir -p assembly/external_group/$Organism/$Strain/cdna
mkdir -p assembly/external_group/$Organism/$Strain/cds
mkdir -p assembly/external_group/$Organism/$Strain/dna
mkdir -p assembly/external_group/$Organism/$Strain/ncrna
mkdir -p assembly/external_group/$Organism/$Strain/pep
Data was downloaded for P. capsici isolate LT1534 from: http://genome.jgi.doe.gov/Phyca11/Phyca11.download.ftp.html This data was moved into the following directories:
Organism=P.capsici
Strain=LT1534
ProjDir=/home/groups/harrisonlab/project_files/idris
cd $ProjDir
mkdir -p assembly/external_group/$Organism/$Strain/dna
mkdir -p assembly/external_group/$Organism/$Strain/pep
mkdir -p assembly/external_group/$Organism/$Strain/annot
Data was downloaded for P. sojae isolate from ... This data was moved into the following directories:
Organism=P.sojae
Strain=67593
ProjDir=/home/groups/harrisonlab/project_files/idris
cd $ProjDir
mkdir -p assembly/external_group/$Organism/$Strain/cdna
mkdir -p assembly/external_group/$Organism/$Strain/cds
mkdir -p assembly/external_group/$Organism/$Strain/dna
mkdir -p assembly/external_group/$Organism/$Strain/ncrna
mkdir -p assembly/external_group/$Organism/$Strain/pep
To run the gene prediction and functional annotation scripts all spaces and pipe symbols had to be removed from the headers of assembly fasta files. This was performed using the following commands:
for File in $(ls assembly/external_group/P.*/*/dna/*.genome.fa); do
OutFile=$(echo $File | sed 's/.fa/.parsed.fa/g');
echo $OutFile; cat $File | sed 's/ /_/g' | sed 's/|/_/g' > $OutFile;
done
Repeat masking was performed on published genomes to allow comparison of repetative between species, including P. cactorum.
Pinf_ass=assembly/external_group/P.infestans/T30-4/dna/Phytophthora_infestans.ASM14294v1.26.dna.genome.parsed.fa
Ppar_ass=assembly/external_group/P.parisitica/310/dna/phytophthora_parasitica_inra_310.i2.scaffolds.genome.parsed.fa
Pcap_ass=assembly/external_group/P.capsici/LT1534/dna/Phyca11_unmasked_genomic_scaffolds.fasta
Psoj_ass=assembly/external_group/P.sojae/67593/dna/Phytophthora_sojae.ASM14975v1.26.dna.genome.parsed.fa
ProgDir=/home/armita/git_repos/emr_repos/tools/gene_prediction/cegma
for Genome in $Pinf_ass $Ppar_ass $Pcap_ass $Psoj_ass; do
ProgDir=/home/armita/git_repos/emr_repos/tools/seq_tools/repeat_masking
qsub $ProgDir/rep_modeling.sh $Genome
qsub $ProgDir/transposonPSI.sh $Genome
done
Gene prediction followed three steps: Pre-gene prediction - Quality of genome assemblies were assessed using Cegma to see how many core eukaryotic genes can be identified. Gene model training - Gene models were trained for the 10300 repeatmasked geneome using assembled RNAseq data and predicted CEGMA genes. Gene prediction - Gene models were used to predict genes in the 103033 genome. This used RNAseq data as hints for gene models.
Quality of genome assemblies was assessed by looking for the gene space in the assemblies.
This was first performed on the 10300 unmasked assembly:
Pinf_ass=assembly/external_group/P.infestans/T30-4/dna/Phytophthora_infestans.ASM14294v1.26.dna.genome.parsed.fa
Ppar_ass=assembly/external_group/P.parisitica/310/dna/phytophthora_parasitica_inra_310.i2.scaffolds.genome.parsed.fa
Pcap_ass=assembly/external_group/P.capsici/LT1534/dna/Phyca11_unmasked_genomic_scaffolds.fasta
Psoj_ass=assembly/external_group/P.sojae/67593/dna/Phytophthora_sojae.ASM14975v1.26.dna.genome.parsed.fa
ProgDir=/home/armita/git_repos/emr_repos/tools/gene_prediction/cegma
for Genome in $Pinf_ass $Ppar_ass $Pcap_ass $Psoj_ass; do
qsub $ProgDir/sub_cegma.sh $Genome dna
done
Open reading frame predictions were made using the atg.pl script as part of the path_pipe.sh pipeline. This pipeline also identifies open reading frames containing Signal peptide sequences and RxLRs. This pipeline was run with the following commands:
Pinf_ass=assembly/external_group/P.infestans/T30-4/dna/Phytophthora_infestans.ASM14294v1.26.dna.genome.parsed.fa
Ppar_ass=assembly/external_group/P.parisitica/310/dna/phytophthora_parasitica_inra_310.i2.scaffolds.genome.parsed.fa
Pcap_ass=assembly/external_group/P.capsici/LT1534/dna/Phyca11_unmasked_genomic_scaffolds.fasta
Psoj_ass=assembly/external_group/P.sojae/67593/dna/Phytophthora_sojae.ASM14975v1.26.dna.genome.parsed.fa
for Genome in $Pinf_ass $Ppar_ass $Pcap_ass $Psoj_ass; do
ProgDir=/home/armita/git_repos/emr_repos/tools/gene_prediction/ORF_finder
qsub $ProgDir/run_ORF_finder.sh $Genome
done
The Gff files from the the ORF finder are not in true Gff3 format. These were corrected using the following commands:
for ORF_Gff in $(ls gene_pred/ORF_finder/P.*/*/*_ORF.gff | grep -v '_atg_'); do
Strain=$(echo $ORF_Gff | rev | cut -f2 -d '/' | rev)
Organism=$(echo $ORF_Gff | rev | cut -f3 -d '/' | rev)
ProgDir=~/git_repos/emr_repos/tools/seq_tools/feature_annotation
ORF_Gff_mod=gene_pred/ORF_finder/$Organism/$Strain/"$Strain"_ORF_corrected.gff3
$ProgDir/gff_corrector.pl $ORF_Gff > $ORF_Gff_mod
done
#Functional annotation
#Genomic analysis
Putative RxLR genes were identified within Augustus gene models using a number of approaches:
- A) From Augustus gene models - Signal peptide & RxLR motif
- B) From Augustus gene models - Hmm evidence of WY domains
- C) From Augustus gene models - Hmm evidence of RxLR effectors
- D) From Augustus gene models - Hmm evidence of CRN effectors
- E) From ORF fragments - Signal peptide & RxLR motif
- F) From ORF fragments - Hmm evidence of WY domains
- G) From ORF fragments - Hmm evidence of RxLR effectors
Required programs:
- SigP
- biopython
Proteins that were predicted to contain signal peptides were identified using the following commands:
Pinf_pep=assembly/external_group/P.infestans/T30-4/pep/Phytophthora_infestans.ASM14294v1.26.pep.all.fa
Ppar_pep=assembly/external_group/P.parisitica/310/pep/phytophthora_parasitica_inra-310_2_proteins.pep.all.fa
Pcap_pep=assembly/external_group/P.capsici/LT1534/pep/Phyca11_filtered_proteins.fasta
Psoj_pep=assembly/external_group/P.sojae/67593/pep/Phytophthora_sojae.ASM14975v1.26.pep.all.fa
for Proteome in $Pinf_pep $Ppar_pep $Pcap_pep $Psoj_pep; do
echo "$Proteome"
SplitfileDir=/home/armita/git_repos/emr_repos/tools/seq_tools/feature_annotation/signal_peptides
ProgDir=/home/armita/git_repos/emr_repos/tools/seq_tools/feature_annotation/signal_peptides
Strain=$(echo $Proteome | rev | cut -f3 -d '/' | rev)
Organism=$(echo $Proteome | rev | cut -f4 -d '/' | rev)
SplitDir=gene_pred/published_split/$Organism/$Strain
mkdir -p $SplitDir
BaseName="$Organism""_$Strain"_published_preds
$SplitfileDir/splitfile_500.py --inp_fasta $Proteome --out_dir $SplitDir --out_base $BaseName
for File in $(ls $SplitDir/*_published_preds_*); do
Jobs=$(qstat | grep 'pred_sigP' | wc -l)
while [ $Jobs -ge 32 ]; do
sleep 10
printf "."
Jobs=$(qstat | grep 'pred_sigP' | wc -l)
done
printf "\n"
echo $File
qsub $ProgDir/pred_sigP.sh $File
# qsub $ProgDir/pred_sigP.sh $File signalp-4.1
done
done
The batch files of predicted secreted proteins needed to be combined into a single file for each strain. This was done with the following commands:
for SplitDir in $(ls -d gene_pred/published_split/P.*/*); do
Strain=$(echo $SplitDir | rev | cut -d '/' -f1 | rev)
Organism=$(echo $SplitDir | rev | cut -d '/' -f2 | rev)
InStringAA=''
InStringNeg=''
InStringTab=''
InStringTxt=''
for GRP in $(ls -l $SplitDir/*_published_preds_*.fa | rev | cut -d '_' -f1 | rev | sort -n); do
InStringAA="$InStringAA gene_pred/published_sigP/$Organism/$Strain/split/"$Organism"_"$Strain"_published_preds_$GRP""_sp.aa";
InStringNeg="$InStringNeg gene_pred/published_sigP/$Organism/$Strain/split/"$Organism"_"$Strain"_published_preds_$GRP""_sp_neg.aa";
InStringTab="$InStringTab gene_pred/published_sigP/$Organism/$Strain/split/"$Organism"_"$Strain"_published_preds_$GRP""_sp.tab";
InStringTxt="$InStringTxt gene_pred/published_sigP/$Organism/$Strain/split/"$Organism"_"$Strain"_published_preds_$GRP""_sp.txt";
done
cat $InStringAA > gene_pred/published_sigP/$Organism/$Strain/"$Strain"_pub_sp.aa
cat $InStringNeg > gene_pred/published_sigP/$Organism/$Strain/"$Strain"_pub_neg_sp.aa
tail -n +2 -q $InStringTab > gene_pred/published_sigP/$Organism/$Strain/"$Strain"_pub_sp.tab
cat $InStringTxt > gene_pred/published_sigP/$Organism/$Strain/"$Strain"_pub_sp.txt
done
The regular expression R.LR.{,40}[ED][ED][KR] has previously been used to identify RxLR effectors. The addition of an EER motif is significant as it has been shown as required for host uptake of the protein.
The RxLR_EER_regex_finder.py script was used to search for this regular expression and annotate the EER domain where present.
for Secretome in $(ls gene_pred/published_sigP/*/*/*pub_sp.aa); do
ProgDir=~/git_repos/emr_repos/tools/pathogen/RxLR_effectors;
Strain=$(echo $Secretome | rev | cut -d '/' -f2 | rev);
Organism=$(echo $Secretome | rev | cut -d '/' -f3 | rev) ;
OutDir=analysis/RxLR_effectors/RxLR_EER_regex_finder/"$Organism"/"$Strain";
mkdir -p $OutDir;
printf "\nstrain: $Strain\tspecies: $Organism\n";
printf "the number of SigP gene is:\t";
cat $Secretome | grep '>' | wc -l;
printf "the number of SigP-RxLR genes are:\t";
$ProgDir/RxLR_EER_regex_finder.py $Secretome > $OutDir/"$Strain"_pub_RxLR_regex.fa;
cat $OutDir/"$Strain"_pub_RxLR_regex.fa | grep '>' | cut -f1 | tr -d '>' | sed -r 's/\.t.*//' > $OutDir/"$Strain"_pub_RxLR_regex.txt
cat $OutDir/"$Strain"_pub_RxLR_regex.txt | wc -l
printf "the number of SigP-RxLR-EER genes are:\t";
cat $OutDir/"$Strain"_pub_RxLR_regex.fa | grep '>' | grep 'EER_motif_start' | cut -f1 | tr -d '>' | sed -r 's/\.t.*//' > $OutDir/"$Strain"_pub_RxLR_EER_regex.txt
cat $OutDir/"$Strain"_pub_RxLR_EER_regex.txt | wc -l
printf "\n"
ProgDir=/home/armita/git_repos/emr_repos/tools/gene_prediction/ORF_finder
# $ProgDir/extract_from_fasta.py --fasta $OutDir/"$Strain"_pub_RxLR_regex.fa --headers $OutDir/"$Strain"_pub_RxLR_EER_regex.txt > $OutDir/"$Strain"_pub_RxLR_EER_regex.fa
# GeneModels=$(ls assembly/external_group/P.*/$Strain/pep/*.gff*)
# cat $GeneModels | grep -w -f $OutDir/"$Strain"_pub_RxLR_regex.txt > $OutDir/"$Strain"_pub_RxLR_regex.gff3
# cat $GeneModels | grep -w -f $OutDir/"$Strain"_pub_RxLR_EER_regex.txt > $OutDir/"$Strain"_pub_RxLR_EER_regex.gff3
done
strain: LT1534 species: P.capsici
the number of SigP gene is: 1650
the number of SigP-RxLR genes are: 161
the number of SigP-RxLR-EER genes are: 108
strain: T30-4 species: P.infestans
the number of SigP gene is: 2187
the number of SigP-RxLR genes are: 510
the number of SigP-RxLR-EER genes are: 357
strain: 310 species: P.parisitica
the number of SigP gene is: 2316
the number of SigP-RxLR genes are: 312
the number of SigP-RxLR-EER genes are: 206
strain: 67593 species: P.sojae
the number of SigP gene is: 2515
the number of SigP-RxLR genes are: 231
the number of SigP-RxLR-EER genes are: 120
Hmm models for the WY domain contained in many RxLRs were used to search gene models predicted with Braker1. These were run with the following commands:
ProgDir=/home/armita/git_repos/emr_repos/scripts/phytophthora/pathogen/hmmer
HmmModel=/home/armita/git_repos/emr_repos/scripts/phytophthora/pathogen/hmmer/WY_motif.hmm
for Proteome in $Pinf_pep $Ppar_pep $Pcap_pep $Psoj_pep; do
Strain=$(echo $Proteome | rev | cut -f3 -d '/' | rev)
Organism=$(echo $Proteome | rev | cut -f4 -d '/' | rev)
OutDir=analysis/RxLR_effectors/hmmer_WY/$Organism/$Strain
mkdir -p $OutDir
HmmResults="$Strain"_pub_WY_hmmer.txt
hmmsearch -T 0 $HmmModel $Proteome > $OutDir/$HmmResults
echo "$Organism $Strain"
cat $OutDir/$HmmResults | grep 'Initial search space'
cat $OutDir/$HmmResults | grep 'number of targets reported over threshold'
HmmFasta="$Strain"_pub_WY_hmmer.fa
$ProgDir/hmmer2fasta.pl $OutDir/$HmmResults $Proteome > $OutDir/$HmmFasta
Headers="$Strain"_pub_WY_hmmer_headers.txt
cat $OutDir/$HmmFasta | grep '>' | cut -f1 | tr -d '>' | sed -r 's/\.t.*//' > $OutDir/$Headers
# GeneModels=$(ls assembly/external_group/P.*/$Strain/pep/*.gff*)
# cat $GeneModels | grep -w -f $OutDir/$Headers > $OutDir/"$Strain"_pub_WY_hmmer.gff3
done
P.infestans T30-4
Initial search space (Z): 17787 [actual number of targets]
Domain search space (domZ): 267 [number of targets reported over threshold]
P.parisitica 310
Initial search space (Z): 20822 [actual number of targets]
Domain search space (domZ): 257 [number of targets reported over threshold]
P.capsici LT1534
Initial search space (Z): 19805 [actual number of targets]
Domain search space (domZ): 106 [number of targets reported over threshold]
P.sojae 67593
Initial search space (Z): 18969 [actual number of targets]
Domain search space (domZ): 147 [number of targets reported over threshold]
for Proteome in $Pinf_pep $Ppar_pep $Pcap_pep $Psoj_pep; do
ProgDir=/home/armita/git_repos/emr_repos/scripts/phytophthora/pathogen/hmmer
HmmModel=/home/armita/git_repos/emr_repos/SI_Whisson_et_al_2007/cropped.hmm
Strain=$(echo $Proteome | rev | cut -f3 -d '/' | rev)
Organism=$(echo $Proteome | rev | cut -f4 -d '/' | rev)
OutDir=analysis/RxLR_effectors/hmmer_RxLR/$Organism/$Strain
mkdir -p $OutDir
HmmResults="$Strain"_pub_RxLR_hmmer.txt
hmmsearch -T 0 $HmmModel $Proteome > $OutDir/$HmmResults
echo "$Organism $Strain"
cat $OutDir/$HmmResults | grep 'Initial search space'
cat $OutDir/$HmmResults | grep 'number of targets reported over threshold'
HmmFasta="$Strain"_pub_RxLR_hmmer.fa
$ProgDir/hmmer2fasta.pl $OutDir/$HmmResults $Proteome > $OutDir/$HmmFasta
Headers="$Strain"_pub_RxLR_hmmer_headers.txt
cat $OutDir/$HmmFasta | grep '>' | cut -f1 | tr -d '>' > $OutDir/$Headers
# # ProgDir=/home/armita/git_repos/emr_repos/tools/seq_tools/feature_annotation
# Col2=cropped.hmm
# GeneModels=$(ls assembly/external_group/P.*/$Strain/pep/*.gff*)
# # $ProgDir/gene_list_to_gff.pl $OutDir/$Headers $GeneModels $Col2 Name > $OutDir/"$Strain"_pub_RxLR_hmmer.gff3
# cat $GeneModels | grep -w -f $OutDir/$Headers > $OutDir/"$Strain"_pub_RxLR_hmmer.gff3
done
P.infestans T30-4
Initial search space (Z): 17787 [actual number of targets]
Domain search space (domZ): 290 [number of targets reported over threshold]
P.parisitica 310
Initial search space (Z): 20822 [actual number of targets]
Domain search space (domZ): 217 [number of targets reported over threshold]
P.capsici LT1534
Initial search space (Z): 19805 [actual number of targets]
Domain search space (domZ): 84 [number of targets reported over threshold]
P.sojae 67593
Initial search space (Z): 18969 [actual number of targets]
Domain search space (domZ): 127 [number of targets reported over threshold]
A hmm model relating to crinkler domains was used to identify putative crinklers in Augustus gene models. This was done with the following commands:
ProgDir=/home/armita/git_repos/emr_repos/scripts/phytophthora/pathogen/hmmer
HmmModel=/home/armita/git_repos/emr_repos/scripts/phytophthora/pathogen/hmmer/Phyt_annot_CRNs_D1.hmm
for Proteome in $Pinf_pep $Ppar_pep $Pcap_pep $Psoj_pep; do
Strain=$(echo $Proteome | rev | cut -f3 -d '/' | rev)
Organism=$(echo $Proteome | rev | cut -f4 -d '/' | rev)
OutDir=analysis/CRN_effectors/hmmer_CRN/$Organism/$Strain
mkdir -p $OutDir
HmmResults="$Strain"_pub_CRN_hmmer.txt
hmmsearch -T 0 $HmmModel $Proteome > $OutDir/$HmmResults
echo "$Organism $Strain"
cat $OutDir/$HmmResults | grep 'Initial search space'
cat $OutDir/$HmmResults | grep 'number of targets reported over threshold'
HmmFasta="$Strain"_pub_CRN_hmmer_out.fa
$ProgDir/hmmer2fasta.pl $OutDir/$HmmResults $Proteome > $OutDir/$HmmFasta
# Headers="$Strain"_pub_RxLR_hmmer_headers.txt
# cat $OutDir/$HmmFasta | grep '>' | cut -f1 | tr -d '>' | sed -r 's/\.t.*//' > $OutDir/$Headers
# GeneModels=$(ls assembly/external_group/P.*/$Strain/pep/*.gff*)
# cat $GeneModels | grep -w -f $OutDir/$Headers > $OutDir/"$Strain"_pub_CRN_hmmer.gff3
done
P.infestans T30-4
Initial search space (Z): 17787 [actual number of targets]
Domain search space (domZ): 187 [number of targets reported over threshold]
P.parisitica 310
Initial search space (Z): 20822 [actual number of targets]
Domain search space (domZ): 42 [number of targets reported over threshold]
P.capsici LT1534
Initial search space (Z): 19805 [actual number of targets]
Domain search space (domZ): 106 [number of targets reported over threshold]
P.sojae 67593
Initial search space (Z): 18969 [actual number of targets]
Domain search space (domZ): 81 [number of targets reported over threshold]
Required programs:
- SigP
- biopython
Proteins that were predicted to contain signal peptides were identified using the following commands:
Pinf_ORF=gene_pred/ORF_finder/P.infestans/T30-4/T30-4.aa_cat.fa
Ppar_ORF=gene_pred/ORF_finder/P.parisitica/310/310.aa_cat.fa
Pcap_ORF=gene_pred/ORF_finder/P.capsici/LT1534/LT1534.aa_cat.fa
Psoj_ORF=gene_pred/ORF_finder/P.sojae/67593/67593.aa_cat.fa
for Proteome in $Pinf_ORF $Ppar_ORF $Pcap_ORF $Psoj_ORF; do
echo "$Proteome"
SplitfileDir=/home/armita/git_repos/emr_repos/tools/seq_tools/feature_annotation/signal_peptides
ProgDir=/home/armita/git_repos/emr_repos/tools/seq_tools/feature_annotation/signal_peptides
Strain=$(echo $Proteome | rev | cut -f2 -d '/' | rev)
Organism=$(echo $Proteome | rev | cut -f3 -d '/' | rev)
SplitDir=gene_pred/ORF_split/$Organism/$Strain
mkdir -p $SplitDir
BaseName="$Organism""_$Strain"_ORF_preds
$SplitfileDir/splitfile_500.py --inp_fasta $Proteome --out_dir $SplitDir --out_base $BaseName
for File in $(ls $SplitDir/*_ORF_preds_*); do
Jobs=$(qstat | grep 'pred_sigP' | wc -l)
while [ $Jobs -ge 80 ]; do
sleep 10
printf "."
Jobs=$(qstat | grep 'pred_sigP' | wc -l)
done
printf "\n"
echo $File
qsub $ProgDir/pred_sigP.sh $File
# qsub $ProgDir/pred_sigP.sh $File signalp-4.1
done
done
The batch files of predicted secreted proteins needed to be combined into a single file for each strain. This was done with the following commands:
for SplitDir in $(ls -d gene_pred/ORF_split/P.*/*); do
Strain=$(echo $SplitDir | rev | cut -d '/' -f1 | rev)
Organism=$(echo $SplitDir | rev | cut -d '/' -f2 | rev)
InStringAA=''
InStringNeg=''
InStringTab=''
InStringTxt=''
for GRP in $(ls -l $SplitDir/*_ORF_preds_*.fa | rev | cut -d '_' -f1 | rev | sort -n); do
InStringAA="$InStringAA gene_pred/ORF_sigP/$Organism/$Strain/split/"$Organism"_"$Strain"_ORF_preds_$GRP""_sp.aa";
InStringNeg="$InStringNeg gene_pred/ORF_sigP/$Organism/$Strain/split/"$Organism"_"$Strain"_ORF_preds_$GRP""_sp_neg.aa";
InStringTab="$InStringTab gene_pred/ORF_sigP/$Organism/$Strain/split/"$Organism"_"$Strain"_ORF_preds_$GRP""_sp.tab";
InStringTxt="$InStringTxt gene_pred/ORF_sigP/$Organism/$Strain/split/"$Organism"_"$Strain"_ORF_preds_$GRP""_sp.txt";
done
cat $InStringAA > gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp.aa
cat $InStringNeg > gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_neg_sp.aa
tail -n +2 -q $InStringTab > gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp.tab
cat $InStringTxt > gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp.txt
done
Names of ORFs containing signal peptides were extracted from fasta files. This included information on the position and hmm score of RxLRs.
for FastaFile in $(ls gene_pred/ORF_sigP/*/*/*_ORF_sp.aa); do
Strain=$(echo $FastaFile | rev | cut -d '/' -f2 | rev)
Organism=$(echo $FastaFile | rev | cut -d '/' -f3 | rev)
echo "$Strain"
SigP_headers=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_names.txt
cat $FastaFile | grep '>' | sed -r 's/>//g' | sed -r 's/\s+/\t/g'| sed 's/=\t/=/g' | sed 's/--//g' > $SigP_headers
done
for FastaFile in $(ls gene_pred/ORF_sigP/*/*/*_ORF_sp.aa | grep -e 'sojae' -e 'infestans'); do
Strain=$(echo $FastaFile | rev | cut -d '/' -f2 | rev)
Organism=$(echo $FastaFile | rev | cut -d '/' -f3 | rev)
echo "$Strain"
SigP_headers=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_names.txt
cat $FastaFile | grep '>' | sed -r 's/>//g' | sed -r 's/\s+/\t/g'| sed 's/=\t/=/g' | sed 's/--//g' | sed 's/:/_a_/g' | sed 's/supercont1./supercont1_b_/g'> $SigP_headers
done
Due to the nature of predicting ORFs, some features overlapped with one another. A single ORF was selected from each set of overlapped ORFs. This was was selected on the basis of its SignalP Hmm score. Biopython was used to identify overlapps and identify the ORF with the best signalP score.
for SigP_fasta in $(ls gene_pred/ORF_sigP/P.infestans/*/*_ORF_sp.aa ); do
Strain=$(echo $SigP_fasta | rev | cut -d '/' -f2 | rev)
Organism=$(echo $SigP_fasta | rev | cut -d '/' -f3 | rev)
echo "$Strain"
ORF_Gff=gene_pred/ORF_finder/$Organism/$Strain/"$Strain"_ORF_corrected.gff3
SigP_fasta=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp.aa
SigP_headers=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_names.txt
SigP_Gff=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_unmerged.gff
SigP_Merged_Gff=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_merged.gff
SigP_Merged_txt=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_merged.txt
SigP_Merged_AA=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_merged.aa
ProgDir=/home/armita/git_repos/emr_repos/tools/gene_prediction/ORF_finder
$ProgDir/extract_gff_for_sigP_hits.pl $SigP_headers $ORF_Gff SigP Name > $SigP_Gff
ProgDir=~/git_repos/emr_repos/scripts/phytophthora/pathogen/merge_gff
$ProgDir/make_gff_database.py --inp $SigP_Gff --db sigP_ORF.db
ProgDir=/home/armita/git_repos/emr_repos/tools/gene_prediction/ORF_finder
$ProgDir/merge_sigP_ORFs.py --inp sigP_ORF.db --id sigP_ORF --out sigP_ORF_merged.db --gff > $SigP_Merged_Gff
cat $SigP_Merged_Gff | grep 'transcript' | rev | cut -f1 -d'=' | rev > $SigP_Merged_txt
$ProgDir/extract_from_fasta.py --fasta $SigP_fasta --headers $SigP_Merged_txt > $SigP_Merged_AA
done
for SigP_fasta in $(ls gene_pred/ORF_sigP/P.*/*/*_ORF_sp.aa | grep -e 'sojae' -e 'infestans'); do
Strain=$(echo $SigP_fasta | rev | cut -d '/' -f2 | rev)
Organism=$(echo $SigP_fasta | rev | cut -d '/' -f3 | rev)
echo "$Strain"
ORF_Gff=gene_pred/ORF_finder/$Organism/$Strain/"$Strain"_ORF_corrected.gff3
SigP_fasta=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp.aa
SigP_headers=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_names.txt
SigP_Gff=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_unmerged.gff
SigP_Merged_Gff=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_merged.gff
SigP_Merged_txt=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_merged.txt
SigP_Merged_AA=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_merged.aa
cat $ORF_Gff | sed 's/:/_a_/g' | sed 's/supercont1./supercont1_b_/g' > tmp.gff
ProgDir=/home/armita/git_repos/emr_repos/tools/gene_prediction/ORF_finder
$ProgDir/extract_gff_for_sigP_hits.pl $SigP_headers tmp.gff SigP Name > $SigP_Gff
ProgDir=~/git_repos/emr_repos/scripts/phytophthora/pathogen/merge_gff
$ProgDir/make_gff_database.py --inp $SigP_Gff --db sigP_ORF.db
ProgDir=/home/armita/git_repos/emr_repos/tools/gene_prediction/ORF_finder
$ProgDir/merge_sigP_ORFs.py --inp sigP_ORF.db --id sigP_ORF --out sigP_ORF_merged.db --gff | sed 's/_a_/:/g' | sed 's/supercont1_b_/supercont1./g' > $SigP_Merged_Gff
cat $SigP_Merged_Gff | grep 'transcript' | rev | cut -f1 -d'=' | rev > $SigP_Merged_txt
$ProgDir/extract_from_fasta.py --fasta $SigP_fasta --headers $SigP_Merged_txt > $SigP_Merged_AA
done
The regular expression R.LR.{,40}[ED][ED][KR] has previously been used to identify RxLR effectors. The addition of an EER motif is significant as it has been shown as required for host uptake of the protein.
The RxLR_EER_regex_finder.py script was used to search for this regular expression and annotate the EER domain where present.
for Secretome in $(ls gene_pred/ORF_sigP/P.*/*/*_ORF_sp_merged.aa); do
ProgDir=~/git_repos/emr_repos/tools/pathogen/RxLR_effectors
Strain=$(echo $Secretome | rev | cut -d '/' -f2 | rev);
Organism=$(echo $Secretome | rev | cut -d '/' -f3 | rev) ;
OutDir=analysis/RxLR_effectors/RxLR_EER_regex_finder/"$Organism"/"$Strain";
mkdir -p $OutDir;
printf "\nstrain: $Strain\tspecies: $Organism\n";
printf "the number of SigP gene is:\t";
cat $Secretome | grep '>' | wc -l;
printf "the number of SigP-RxLR genes are:\t";
$ProgDir/RxLR_EER_regex_finder.py $Secretome > $OutDir/"$Strain"_ORF_RxLR_EER_regex.fa;
cat $OutDir/"$Strain"_ORF_RxLR_EER_regex.fa | grep '>' | cut -f1 | tr -d '>' | sed -r 's/\.t.*//' | tr -d ' ' > $OutDir/"$Strain"_ORF_RxLR_regex.txt
cat $OutDir/"$Strain"_ORF_RxLR_regex.txt | wc -l
printf "the number of SigP-RxLR-EER genes are:\t";
cat $OutDir/"$Strain"_ORF_RxLR_EER_regex.fa | grep '>' | grep 'EER_motif_start' | cut -f1 | tr -d '>' | sed -r 's/\.t.*//' | tr -d ' '> $OutDir/"$Strain"_ORF_RxLR_EER_regex.txt
cat $OutDir/"$Strain"_ORF_RxLR_EER_regex.txt | wc -l
printf "\n"
ProgDir=/home/armita/git_repos/emr_repos/tools/seq_tools/feature_annotation
$ProgDir/gene_list_to_gff.pl $OutDir/"$Strain"_ORF_RxLR_regex.txt $SigP_Merged_Gff RxLR_EER_regex_finder.py Name Augustus > $OutDir/"$Strain"_ORF_RxLR_regex.gff
ProgDir=/home/armita/git_repos/emr_repos/tools/seq_tools/feature_annotation
$ProgDir/gene_list_to_gff.pl $OutDir/"$Strain"_ORF_RxLR_EER_regex.txt $SigP_Merged_Gff RxLR_EER_regex_finder.py Name Augustus > $OutDir/"$Strain"_ORF_RxLR_EER_regex.gff
done
strain: 10300 species: P.cactorum
the number of SigP gene is: 15271
the number of SigP-RxLR genes are: 935
the number of SigP-RxLR-EER genes are: 170
strain: LT1534 species: P.capsici
the number of SigP gene is: 14865
the number of SigP-RxLR genes are: 1004
the number of SigP-RxLR-EER genes are: 254
strain: T30-4 species: P.infestans
the number of SigP gene is: 37637
the number of SigP-RxLR genes are: 2825
the number of SigP-RxLR-EER genes are: 417
strain: 310 species: P.parisitica
the number of SigP gene is: 14526
the number of SigP-RxLR genes are: 949
the number of SigP-RxLR-EER genes are: 309
strain: 67593 species: P.sojae
the number of SigP gene is: 24222
the number of SigP-RxLR genes are: 1727
the number of SigP-RxLR-EER genes are: 282
Hmm models for the WY domain contained in many RxLRs were used to search ORFs predicted with atg.pl. These were run with the following commands:
for Secretome in $(ls gene_pred/ORF_sigP/P.*/*/*_ORF_sp_merged.aa); do
ProgDir=/home/armita/git_repos/emr_repos/scripts/phytophthora/pathogen/hmmer
HmmModel=/home/armita/git_repos/emr_repos/scripts/phytophthora/pathogen/hmmer/WY_motif.hmm
Strain=$(echo $Secretome | rev | cut -f2 -d '/' | rev)
Organism=$(echo $Secretome | rev | cut -f3 -d '/' | rev)
OutDir=analysis/RxLR_effectors/hmmer_WY/$Organism/$Strain
mkdir -p $OutDir
HmmResults="$Strain"_ORF_WY_hmmer.txt
hmmsearch -T 0 $HmmModel $Secretome > $OutDir/$HmmResults
echo "$Organism $Strain"
cat $OutDir/$HmmResults | grep 'Initial search space'
cat $OutDir/$HmmResults | grep 'number of targets reported over threshold'
HmmFasta="$Strain"_ORF_WY_hmmer.fa
$ProgDir/hmmer2fasta.pl $OutDir/$HmmResults $Secretome > $OutDir/$HmmFasta
Headers="$Strain"_ORF_WY_hmmer_headers.txt
cat $OutDir/$HmmFasta | grep '>' | cut -f1 | tr -d '>' | sed -r 's/\.t.*//' | tr -d ' ' > $OutDir/$Headers
SigP_Merged_Gff=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_merged.gff
ProgDir=/home/armita/git_repos/emr_repos/tools/seq_tools/feature_annotation
$ProgDir/gene_list_to_gff.pl $OutDir/$Headers $SigP_Merged_Gff $HmmModel Name Augustus > $OutDir/"$Strain"_ORF_WY_hmmer.gff
done
P.cactorum 10300
Initial search space (Z): 15271 [actual number of targets]
Domain search space (domZ): 113 [number of targets reported over threshold]
P.capsici LT1534
Initial search space (Z): 14865 [actual number of targets]
Domain search space (domZ): 117 [number of targets reported over threshold]
P.infestans T30-4
Initial search space (Z): 37637 [actual number of targets]
Domain search space (domZ): 244 [number of targets reported over threshold]
P.parisitica 310
Initial search space (Z): 14526 [actual number of targets]
Domain search space (domZ): 175 [number of targets reported over threshold]
P.sojae 67593
Initial search space (Z): 24222 [actual number of targets]
Domain search space (domZ): 159 [number of targets reported over threshold]
for Secretome in $(ls gene_pred/ORF_sigP/P.*/*/*_ORF_sp_merged.aa); do
ProgDir=/home/armita/git_repos/emr_repos/scripts/phytophthora/pathogen/hmmer
HmmModel=/home/armita/git_repos/emr_repos/SI_Whisson_et_al_2007/cropped.hmm
Strain=$(echo $Secretome | rev | cut -f2 -d '/' | rev)
Organism=$(echo $Secretome | rev | cut -f3 -d '/' | rev)
OutDir=analysis/RxLR_effectors/hmmer_RxLR/$Organism/$Strain
mkdir -p $OutDir
HmmResults="$Strain"_ORF_RxLR_hmmer.txt
hmmsearch -T 0 $HmmModel $Secretome > $OutDir/$HmmResults
echo "$Organism $Strain"
cat $OutDir/$HmmResults | grep 'Initial search space'
cat $OutDir/$HmmResults | grep 'number of targets reported over threshold'
HmmFasta="$Strain"_ORF_RxLR_hmmer.fa
$ProgDir/hmmer2fasta.pl $OutDir/$HmmResults $Secretome > $OutDir/$HmmFasta
Headers="$Strain"_ORF_RxLR_hmmer_headers.txt
cat $OutDir/$HmmFasta | grep '>' | cut -f1 | tr -d '>' | sed -r 's/\.t.*//' | tr -d ' ' > $OutDir/$Headers
SigP_Merged_Gff=gene_pred/ORF_sigP/$Organism/$Strain/"$Strain"_ORF_sp_merged.gff
ProgDir=/home/armita/git_repos/emr_repos/tools/seq_tools/feature_annotation
$ProgDir/gene_list_to_gff.pl $OutDir/$Headers $SigP_Merged_Gff $HmmModel Name Augustus > $OutDir/"$Strain"_ORF_RxLR_hmmer.gff3
done
P.cactorum 10300
Initial search space (Z): 15271 [actual number of targets]
Domain search space (domZ): 145 [number of targets reported over threshold]
P.capsici LT1534
Initial search space (Z): 14865 [actual number of targets]
Domain search space (domZ): 165 [number of targets reported over threshold]
P.infestans T30-4
Initial search space (Z): 37637 [actual number of targets]
Domain search space (domZ): 284 [number of targets reported over threshold]
P.parisitica 310
Initial search space (Z): 14526 [actual number of targets]
Domain search space (domZ): 244 [number of targets reported over threshold]
P.sojae 67593
Initial search space (Z): 24222 [actual number of targets]
Domain search space (domZ): 228 [number of targets reported over threshold]
A hmm model relating to crinkler domains was used to identify putative crinklers in ORF gene models. This was done with the following commands:
for Proteome in $(ls gene_pred/ORF_finder/*/*/*.aa_cat.fa); do
ProgDir=/home/armita/git_repos/emr_repos/scripts/phytophthora/pathogen/hmmer
HmmModel=/home/armita/git_repos/emr_repos/scripts/phytophthora/pathogen/hmmer/Phyt_annot_CRNs_D1.hmm
Strain=$(echo $Proteome | rev | cut -f2 -d '/' | rev)
Organism=$(echo $Proteome | rev | cut -f3 -d '/' | rev)
OutDir=analysis/CRN_effectors/hmmer_CRN/$Organism/$Strain
mkdir -p $OutDir
HmmResults="$Strain"_ORF_CRN_unmerged_hmmer.txt
hmmsearch -T 0 $HmmModel $Proteome > $OutDir/$HmmResults
echo "$Organism $Strain"
cat $OutDir/$HmmResults | grep 'Initial search space'
cat $OutDir/$HmmResults | grep 'number of targets reported over threshold'
HmmFasta="$Strain"_ORF_CRN_hmmer_unmerged_out.fa
$ProgDir/hmmer2fasta.pl $OutDir/$HmmResults $Proteome > $OutDir/$HmmFasta
Headers="$Strain"_CRN_hmmer_unmerged_headers.txt
cat $OutDir/$HmmFasta | grep '>' | tr -d '>' | sed -r 's/\s+/\t/g'| sed 's/=\t/=/g' | tr -d '-' | sed 's/hmm_score/HMM_score/g' > $OutDir/$Headers
cat $OutDir/$Headers | sed 's/:/_a_/g' | sed 's/supercont1./supercont1_b_/g' > tmp.txt
ORF_Gff=$(ls gene_pred/ORF_finder/$Organism/$Strain/*_ORF_corrected.gff3)
cat $ORF_Gff | sed 's/:/_a_/g' | sed 's/supercont1./supercont1_b_/g' > tmp.gff
CRN_unmerged_Gff=$OutDir/"$Strain"_CRN_unmerged_hmmer.gff3
ProgDir=/home/armita/git_repos/emr_repos/tools/gene_prediction/ORF_finder
$ProgDir/extract_gff_for_sigP_hits.pl tmp.txt tmp.gff CRN_HMM Name > $CRN_unmerged_Gff
DbDir=analysis/databases/$Organism/$Strain
mkdir -p $DbDir
ProgDir=~/git_repos/emr_repos/scripts/phytophthora/pathogen/merge_gff
$ProgDir/make_gff_database.py --inp $CRN_unmerged_Gff --db $DbDir/CRN_ORF.db
CRN_Merged_Gff=$OutDir/"$Strain"_CRN_merged_hmmer.gff3
ProgDir=/home/armita/git_repos/emr_repos/tools/gene_prediction/ORF_finder
$ProgDir/merge_sigP_ORFs.py --inp $DbDir/CRN_ORF.db --id $HmmModel --out $DbDir/CRN_ORF_merged.db --gff > $CRN_Merged_Gff
sed -i 's/_a_/:/g' $CRN_Merged_Gff
sed -i 's/supercont1_b_/supercont1./g' $CRN_Merged_Gff
echo "Number of CRN ORFs after merging:"
cat $CRN_Merged_Gff | grep 'gene' | wc -l
rm tmp.txt
rm tmp.gff
done
P.cactorum 10300 Initial search space (Z): 459307 [actual number of targets] Domain search space (domZ): 225 [number of targets reported over threshold]
Due to RxLR effectors being predicted from a number of sources the number of unique RxLRs were identified from motif and Hmm searches within gene models. 221 RxLR effectors were predicted in total from the P.cactorum genome. Of these, 90 were shared between both datasets.
for InDir in $(ls -d analysis/RxLR_effectors/RxLR_EER_regex_finder/*/*); do
Strain=$(echo "$InDir" | rev | cut -f1 -d '/' | rev)
Species=$(echo "$InDir" | rev | cut -f2 -d '/' | rev)
RxLR_motif=$(ls analysis/RxLR_effectors/RxLR_EER_regex_finder/$Species/$Strain/"$Strain"*_RxLR_EER_regex.fa | grep -v 'ORF')
RxLR_hmm=$(ls analysis/RxLR_effectors/hmmer_RxLR/$Species/$Strain/"$Strain"*_RxLR_hmmer.fa | grep -v 'ORF')
echo "$Species - $Strain"
echo "Total number of RxLRs in predicted genes:"
cat $RxLR_motif $RxLR_hmm | grep '>' | cut -f1 | sort | uniq | wc -l
echo "Total number of RxLRs shared between prediciton sources:"
cat $RxLR_motif $RxLR_hmm | grep '>' | cut -f1 | sort | uniq -d | wc -l
echo ""
done
P.cactorum - 10300
Total number of RxLRs in predicted genes:
291
Total number of RxLRs shared between prediciton sources:
90
P.capsici - LT1534
Total number of RxLRs in predicted genes:
245
Total number of RxLRs shared between prediciton sources:
0
P.infestans - T30-4
Total number of RxLRs in predicted genes:
870
Total number of RxLRs shared between prediciton sources:
70
P.parisitica - 310
Total number of RxLRs in predicted genes:
609
Total number of RxLRs shared between prediciton sources:
80
P.sojae - 67593
Total number of RxLRs in predicted genes:
531
Total number of RxLRs shared between prediciton sources:
173