Skip to content

Latest commit

 

History

History
223 lines (176 loc) · 9.8 KB

establish_variant_differences.md

File metadata and controls

223 lines (176 loc) · 9.8 KB

Work performed in directory

/home/groups/harrisonlab/project_files/venturia

After calling SNPs and stuctural variants, it may be useful to contrast variant population frequencies among different groups of individuals (populations) to e.g. further zoom in on candidate effector genes likely involved in resistance in different cultivars.

V. inequalis SNPs

First, create a cut-down VCF file containing only individuals of interest (Worcester and Bramley populations) and then filter to remove SNPs with too many missing genotypes.

source /home/sobczm/bin/marias_profile

vcflib=/home/sobczm/bin/vcflib/bin

$vcflib/vcfremovesamples SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3.vcf 083 096 097 098 101 106 119 >Ash_farm_172_pacbio_contigs_unmasked_3_bw.vcf 
$vcflib/vcfremovesamples SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3.vcf 049 172 173 182 190 196 197 202 >Ash_farm_172_pacbio_contigs_unmasked_3_bc.vcf
$vcflib/vcfremovesamples SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3.vcf 007 024 025 030 044 199 >Ash_farm_172_pacbio_contigs_unmasked_3_cw.vcf

mv Ash_farm_172_pacbio_contigs_unmasked_3_bw.vcf SNP_calling/
mv Ash_farm_172_pacbio_contigs_unmasked_3_bc.vcf SNP_calling/
mv Ash_farm_172_pacbio_contigs_unmasked_3_cw.vcf SNP_calling/

vcftools=/home/sobczm/bin/vcftools/bin

$vcftools/vcftools --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bw.vcf  --max-missing 0.95 --recode --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered
$vcftools/vcftools --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bc.vcf  --max-missing 0.95 --recode --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bc_filtered
$vcftools/vcftools --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_cw.vcf  --max-missing 0.95 --recode --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_cw_filtered

Create custom SnpEff genome database

SnpEff=/home/sobczm/bin/snpEff
nano $SnpEff/snpEff.config

Add the following lines to the section with databases:

#---
# EMR Databases
#----
# Fus2 genome
Fus2v1.0.genome : Fus2
# Bc16 genome
Bc16v1.0.genome: BC-16
# P414 genome
P414v1.0.genome: 414
# 172_pacbio genome
172_pacbiov1.0.genome: 172_pacbio

Collect input files

Reference=$(ls repeat_masked/v.inaequalis/172_pacbio/filtered_contigs_repmask/172_pacbio_contigs_unmasked.fa)
Gff=$(ls gene_pred/codingquary/v.inaequalis/172_pacbio/final/final_genes_appended.gff3)
SnpEff=/home/sobczm/bin/snpEff
mkdir $SnpEff/data/172_pacbiov1.0
cp $Reference $SnpEff/data/172_pacbiov1.0/sequences.fa
cp $Gff $SnpEff/data/172_pacbiov1.0/genes.gff

Build database using GFF3 annotation

java -jar $SnpEff/snpEff.jar build -gff3 -v 172_pacbiov1.0

Annotate VCF files

for a in $(ls SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_*_filtered.recode.vcf); do
    echo $a
    filename=$(basename "$a")
    SnpEff=/home/sobczm/bin/snpEff
    java -Xmx4g -jar $SnpEff/snpEff.jar -v -ud 0 172_pacbiov1.0 $a > ${filename%.vcf}_annotated.vcf
    mv snpEff_genes.txt SNP_calling/snpEff_genes_${filename%.vcf}.txt
    mv snpEff_summary.html SNP_calling/snpEff_summary_${filename%.vcf}.html
    mv *_filtered* SNP_calling/.
done
for a in $(ls SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_*_filtered.recode.vcf | grep -v -e 'bw'); do
    echo $a
    filename=$(basename "$a")
    SnpEff=/home/sobczm/bin/snpEff
    java -Xmx4g -jar $SnpEff/snpEff.jar -v -ud 0 172_pacbiov1.0 $a > ${filename%.vcf}_annotated.vcf
    mv snpEff_genes.txt SNP_calling/snpEff_genes_${filename%.vcf}.txt
    mv snpEff_summary.html SNP_calling/snpEff_summary_${filename%.vcf}.html
    mv *_filtered* SNP_calling/.
done

2018 Assembly re-run

Due to re-assembly of genome for submission to NCBI need to re-run the annotation of VCF files.

Build new genome database

Collect input files

Reference=$(ls repeat_masked/v.inaequalis/172_pacbio/filtered_contigs_repmask/172_pacbio_contigs_unmasked.fa)
Gff=$(ls gene_pred/final/v.inaequalis/172_pacbio/final_2018/final_genes_appended_renamed.gff3)
SnpEff=/home/sobczm/bin/snpEff
mkdir $SnpEff/data/172_pacbiov1.0
cp $Reference $SnpEff/data/172_pacbiov1.0/sequences.fa
cp $Gff $SnpEff/data/172_pacbiov1.0/genes.gff
mkdir SNP_calling/2018

Build database using GFF3 annotation

java -jar $SnpEff/snpEff.jar build -gff3 -v 172_pacbiov1.0

Annotate VCF files

for a in $(ls SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_*_filtered.recode.vcf); do
    echo $a
    filename=$(basename "$a")
    SnpEff=/home/sobczm/bin/snpEff
    java -Xmx4g -jar $SnpEff/snpEff.jar -v -ud 0 172_pacbiov1.0 $a > ${filename%.vcf}_annotated.vcf
    mv snpEff_genes.txt SNP_calling/snpEff_genes_${filename%.vcf}.txt
    mv snpEff_summary.html SNP_calling/snpEff_summary_${filename%.vcf}.html
    mv *_filtered* SNP_calling/2018/.
done

Groups of isolates from different cultivars described, 8 isolates for Worcester (pop1 below) and 6 isolates for Bramley (pop2 below); two Bramley isolates lost due to poor sequencing (036 and 057)

Important: check script options below, in order to use the correct ones in a given analysis (e.g. "ply" argument for ploidy).

scripts=/home/sobczm/bin/popgen/summary_stats
python $scripts/vcf_find_difference_pop.py --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered.recode.vcf --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed.vcf --ply 1 --pop1 202,,182,,173,,190,,172,,197,,196,,049 --pop2 024,,030,,007,,025,,044,,199 --thr 0.95

Groups of isolates from different cultivars described, 7 isolates for Cox (pop1 below) and 6 isolates for Bramley (pop2 below); one Cox (118) and two Bramley (036 and 057) isolates lost due to poor sequencing

scripts=/home/sobczm/bin/popgen/summary_stats
python $scripts/vcf_find_difference_pop.py --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bc_filtered.recode.vcf --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bc_filtered_fixed.vcf --ply 1 --pop1 083,,096,,097,,098,,101,,106,,119 --pop2 024,,030,,007,,025,,044,,199 --thr 0.95

Groups of isolates from different cultivars described, 7 isolates for Cox (pop1 below) and 8 isolates for Worcester (pop2 below); one Cox (118) isolate lost due to poor sequencing

scripts=/home/sobczm/bin/popgen/summary_stats
python $scripts/vcf_find_difference_pop.py --vcf SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_cw_filtered.recode.vcf --out SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_cw_filtered_fixed.vcf --ply 1 --pop1 083,,096,,097,,098,,101,,106,,119 --pop2 202,,182,,173,,190,,172,,197,,196,,049 --thr 0.95

2018 re-annotation SNPs in genes and non-synonymous

input=SNP_calling/2018
scripts=/home/passet/git_repos/scripts/popgen

cp /home/sobczm/bin/snpEff/data/172_pacbiov1.0 $input
cp /home/groups/harrisonlab/project_files/venturia/SNP_calling/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed.vcf $input

cd $input
#Annotate the file with fixed SNPs to filter out only non-synonymous variants
$scripts/summary_stats/annotate_snps_genome.sh Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed.vcf 172_pacbiov1.0

mkdir vcf_files
mv *.vcf vcf_files

cd ../..

#Prefilter the GFF files for different classes of genes to only keep lines showing gene annotation, and not individual exons etc.
awk '$3=="gene"' gene_pred/final/v.inaequalis/172_pacbio/final_2018/final_genes_appended_renamed.gff3 >$input/final_genes_appended_renamed_gene.gff

#Simple intersect of different classes of genes with fixed SNPs and SVs.
#
for vcf_file in $input/vcf_files/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed_gene.vcf $input/vcf_files/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed_coding.vcf $input/vcf_files/Ash_farm_172_pacbio_contigs_unmasked_3_bw_filtered_fixed_nonsyn.vcf
do
for gff_file in $input/final_genes_appended_renamed_gene.gff 
do
vcf=$(basename $vcf_file)
intersectBed -wb -a $vcf_file -b $gff_file > ${gff_file}_${vcf%.vcf}_overlap
done
done