Work performed in directory
/home/groups/harrisonlab/project_files/venturia
After calling SNPs and stuctural variants, it may be useful to contrast variant population frequencies among different groups of individuals (populations) to e.g. further zoom in on candidate effector genes likely involved in resistance in different cultivars.
First, create a cut-down VCF file containing only individuals of interest (Worcester and Bramley populations) and then filter to remove SNPs with too many missing genotypes.
source /home/sobczm/bin/marias_profile
vcflib=/home/sobczm/bin/vcflib/bin
$vcflib/vcfremovesamples SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3.vcf 083 096 097 098 101 106 119 >Ash_farm_172_pacbio_contigs_unmasked_3_bw.vcf
$vcflib/vcfremovesamples SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3.vcf 049 172 173 182 190 196 197 202 >Ash_farm_172_pacbio_contigs_unmasked_3_bc.vcf
$vcflib/vcfremovesamples SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3.vcf 007 024 025 030 044 199 >Ash_farm_172_pacbio_contigs_unmasked_3_cw.vcf
mv Ash_farm_172_pacbio_contigs_unmasked_3_bw.vcf SNP_calling/
mv Ash_farm_172_pacbio_contigs_unmasked_3_bc.vcf SNP_calling/
mv Ash_farm_172_pacbio_contigs_unmasked_3_cw.vcf SNP_calling/
vcftools=/home/sobczm/bin/vcftools/bin
$vcftools/vcftools --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bw.vcf --max-missing 0.95 --recode --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered
$vcftools/vcftools --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bc.vcf --max-missing 0.95 --recode --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bc_filtered
$vcftools/vcftools --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_cw.vcf --max-missing 0.95 --recode --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_cw_filtered
SnpEff=/home/sobczm/bin/snpEff
nano $SnpEff/snpEff.config
Add the following lines to the section with databases:
#---
# EMR Databases
#----
# Fus2 genome
Fus2v1.0.genome : Fus2
# Bc16 genome
Bc16v1.0.genome: BC-16
# P414 genome
P414v1.0.genome: 414
# 172_pacbio genome
172_pacbiov1.0.genome: 172_pacbio
Reference=$(ls repeat_masked/v.inaequalis/172_pacbio/filtered_contigs_repmask/172_pacbio_contigs_unmasked.fa)
Gff=$(ls gene_pred/codingquary/v.inaequalis/172_pacbio/final/final_genes_appended.gff3)
SnpEff=/home/sobczm/bin/snpEff
mkdir $SnpEff/data/172_pacbiov1.0
cp $Reference $SnpEff/data/172_pacbiov1.0/sequences.fa
cp $Gff $SnpEff/data/172_pacbiov1.0/genes.gff
java -jar $SnpEff/snpEff.jar build -gff3 -v 172_pacbiov1.0
for a in $(ls SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_*_filtered.recode.vcf); do
echo $a
filename=$(basename "$a")
SnpEff=/home/sobczm/bin/snpEff
java -Xmx4g -jar $SnpEff/snpEff.jar -v -ud 0 172_pacbiov1.0 $a > ${filename%.vcf}_annotated.vcf
mv snpEff_genes.txt SNP_calling/snpEff_genes_${filename%.vcf}.txt
mv snpEff_summary.html SNP_calling/snpEff_summary_${filename%.vcf}.html
mv *_filtered* SNP_calling/.
done
for a in $(ls SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_*_filtered.recode.vcf | grep -v -e 'bw'); do
echo $a
filename=$(basename "$a")
SnpEff=/home/sobczm/bin/snpEff
java -Xmx4g -jar $SnpEff/snpEff.jar -v -ud 0 172_pacbiov1.0 $a > ${filename%.vcf}_annotated.vcf
mv snpEff_genes.txt SNP_calling/snpEff_genes_${filename%.vcf}.txt
mv snpEff_summary.html SNP_calling/snpEff_summary_${filename%.vcf}.html
mv *_filtered* SNP_calling/.
done
Due to re-assembly of genome for submission to NCBI need to re-run the annotation of VCF files.
Reference=$(ls repeat_masked/v.inaequalis/172_pacbio/filtered_contigs_repmask/172_pacbio_contigs_unmasked.fa)
Gff=$(ls gene_pred/final/v.inaequalis/172_pacbio/final_2018/final_genes_appended_renamed.gff3)
SnpEff=/home/sobczm/bin/snpEff
mkdir $SnpEff/data/172_pacbiov1.0
cp $Reference $SnpEff/data/172_pacbiov1.0/sequences.fa
cp $Gff $SnpEff/data/172_pacbiov1.0/genes.gff
mkdir SNP_calling/2018
java -jar $SnpEff/snpEff.jar build -gff3 -v 172_pacbiov1.0
for a in $(ls SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_*_filtered.recode.vcf); do
echo $a
filename=$(basename "$a")
SnpEff=/home/sobczm/bin/snpEff
java -Xmx4g -jar $SnpEff/snpEff.jar -v -ud 0 172_pacbiov1.0 $a > ${filename%.vcf}_annotated.vcf
mv snpEff_genes.txt SNP_calling/snpEff_genes_${filename%.vcf}.txt
mv snpEff_summary.html SNP_calling/snpEff_summary_${filename%.vcf}.html
mv *_filtered* SNP_calling/2018/.
done
Groups of isolates from different cultivars described, 8 isolates for Worcester (pop1 below) and 6 isolates for Bramley (pop2 below); two Bramley isolates lost due to poor sequencing (036 and 057)
Important: check script options below, in order to use the correct ones in a given analysis (e.g. "ply" argument for ploidy).
scripts=/home/sobczm/bin/popgen/summary_stats
python $scripts/vcf_find_difference_pop.py --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered.recode.vcf --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed.vcf --ply 1 --pop1 202,,182,,173,,190,,172,,197,,196,,049 --pop2 024,,030,,007,,025,,044,,199 --thr 0.95
Groups of isolates from different cultivars described, 7 isolates for Cox (pop1 below) and 6 isolates for Bramley (pop2 below); one Cox (118) and two Bramley (036 and 057) isolates lost due to poor sequencing
scripts=/home/sobczm/bin/popgen/summary_stats
python $scripts/vcf_find_difference_pop.py --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bc_filtered.recode.vcf --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bc_filtered_fixed.vcf --ply 1 --pop1 083,,096,,097,,098,,101,,106,,119 --pop2 024,,030,,007,,025,,044,,199 --thr 0.95
Groups of isolates from different cultivars described, 7 isolates for Cox (pop1 below) and 8 isolates for Worcester (pop2 below); one Cox (118) isolate lost due to poor sequencing
scripts=/home/sobczm/bin/popgen/summary_stats
python $scripts/vcf_find_difference_pop.py --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_cw_filtered.recode.vcf --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_cw_filtered_fixed.vcf --ply 1 --pop1 083,,096,,097,,098,,101,,106,,119 --pop2 202,,182,,173,,190,,172,,197,,196,,049 --thr 0.95
input=SNP_calling/2018
scripts=/home/passet/git_repos/scripts/popgen
cp /home/sobczm/bin/snpEff/data/172_pacbiov1.0 $input
cp /home/groups/harrisonlab/project_files/venturia/SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed.vcf $input
cd $input
#Annotate the file with fixed SNPs to filter out only non-synonymous variants
$scripts/summary_stats/annotate_snps_genome.sh Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed.vcf 172_pacbiov1.0
mkdir vcf_files
mv *.vcf vcf_files
cd ../..
#Prefilter the GFF files for different classes of genes to only keep lines showing gene annotation, and not individual exons etc.
awk '$3=="gene"' gene_pred/final/v.inaequalis/172_pacbio/final_2018/final_genes_appended_renamed.gff3 >$input/final_genes_appended_renamed_gene.gff
#Simple intersect of different classes of genes with fixed SNPs and SVs.
#
for vcf_file in $input/vcf_files/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed_gene.vcf $input/vcf_files/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed_coding.vcf $input/vcf_files/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed_nonsyn.vcf
do
for gff_file in $input/final_genes_appended_renamed_gene.gff
do
vcf=$(basename $vcf_file)
intersectBed -wb -a $vcf_file -b $gff_file > ${gff_file}_${vcf%.vcf}_overlap
done
done