-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathHOWTOXtX_BAYPASS.Rmd
192 lines (141 loc) · 8.13 KB
/
HOWTOXtX_BAYPASS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: "XtX subpopulation differentiation (BayPass)"
output:
html_document:
fig_caption: yes
highlight: zenburn
theme: cerulean
toc: yes
toc_depth: 4
pdf_document:
fig_caption: yes
highlight: zenburn
toc: yes
toc_depth: 4
word_document: default
date: '`r format(Sys.Date(),"%d/%b/%Y")`'
---
```{r knitr setup, include=FALSE, eval=TRUE, echo=FALSE, warning=FALSE}
library(knitr)
knitr::opts_chunk$set(eval=TRUE, cache=FALSE, message=FALSE, warning=FALSE,
comment = "", results="markup")
```
This document describes how SNP genotypes from the Spanish Barley Core Collection (SBCC) were checked and processed for further use with [BayPass v2.1](http://www1.montpellier.inra.fr/CBGP/software/baypass)
to obtain XtX statistics about subpopulation differentiation.
In the next steps we convert the SNPs to fit the appropriate formats for downstream analyses.
SNPs conserve sample order of climate data.
We shall write a **SNPSFILE** containing allele counts across populations/barleys, where each SNP is represented by two lines in the file, with the counts of allele 1 on the first line and the counts for allele 2 on the second, and son on. The counts of allele 1 and allele 2 are assumed to sum to the sample size typed at this SNP in this population (i.e. the total sample size excluding missing data).
This means that **PAV markers must be converted to SNP-like markers** if they are to be used.
First, we shall write a **genotyping file** containing allele counts across populations/barleys, where each biallelic SNP is represented by two columns per row, with the counts of allele 1 on the first line and the counts for allele 2 on the second. The counts of allele 1 and allele 2 are assumed to sum to the sample size typed at this SNP in this population (i.e. the total sample size excluding missing data).
Second, SNPs will be used to compute a **covariance matrix** which captures the background similarity between landraces.
## Formatting SNPs
We wrote a Perl script ([*SNP2BAYPASS.pl*](./SNP2BAYPASS.pl)) to carry out these formatting tasks. The initial set of 9,920 Infinium and GBS markers in file [*9920_SNPs_SBCC_50K.tsv*](raw/9920_SNPs_SBCC_50K.tsv), was converted as follows, accepting **10% missing data per position** and accepting only biallelic loci with $MAF \geq 0.05$ (n=8457):
```{r, engine='bash'}
./SNP2BAYPASS.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order.txt \
SBCC_9K_BayPass.tsv 2> SBCC_9K_BayPass.log
head SBCC_9K_BayPass.log
echo ...
tail SBCC_9K_BayPass.log
```
The resulting [*SBCC_9K_BayPass.tsv*](./SBCC_9K_BayPass.tsv) file is the **genotyping file** required by BayPass. Note that file [*SBCC_9K_BayPass.annot.tsv*](./SBCC_9K_BayPass.annot.tsv) with matching fullnames of SNPs (columns) is produced alongside.
As BayPass does not handle missing covariate (environmental) data,
produce also a subset of accessions for which complete env data is available, excluding SBCC036:
```{r complete_env, engine='bash'}
./SNP2BAYPASS.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order_complete_env.txt \
SBCC_9K_BayPass_complete_env.tsv 2> SBCC_9K_BayPass.complete_env.log
head SBCC_9K_BayPass.complete_env.log
echo ...
tail SBCC_9K_BayPass.complete_env.log
# skip column 30 in environfile, which corresponded to SBCC036
perl -lane 'print join("\t",@F[0 .. 28, 30 .. $#F])' SBCC_environfile.tsv \
> SBCC_environfile_complete_env.tsv
```
## Obtaining a covariance matrix (omega)
As mentioned earlier, we need to estimate a covariance matrix by invoking
a system-available binary of BayPass, called 'baypass'.
The following command computes a matrix based on all 9K Infinium/GBS markers:
```{r, engine='bash', eval=FALSE}
baypass -npop 135 -gfile SBCC_9K_BayPass.tsv -outprefix SBCC_9K_BayPass \
-nthreads 30 -seed 12345 > SBCCmatrix.BayPass.log
mv SBCC_9K_BayPass_* SBCCmatrix.BayPass.log BayPass
# after leaving out SBCC036 to get a complete environmental data set
baypass -npop 134 -gfile SBCC_9K_BayPass_complete_env.tsv -outprefix SBCC_9K_BayPass_complete_env \
-nthreads 30 -seed 12345 > SBCCmatrix.BayPass.complete_env.log
mv SBCC_9K_BayPass_complete_env_* SBCCmatrix.BayPass.complete_env.log BayPass
```
## Population differentiation XtX analysis
A similar Perl script ([*SNP2BAYPASSsubpops.pl*](./SNP2BAYPASSsubpops.pl)) was written to format the same SNP set for the analysis of the 4 subpopulations of SBCC barleys defined by STRUCTURE (see protocol [HOWTOstructure](./HOWTOstructure.html)). The set of 9,920 Infinium and GBS markers in file [*9920_SNPs_SBCC_50K.tsv*](raw/9920_SNPs_SBCC_50K.tsv), was converted as follows, accepting **10% missing data per position** and accepting only biallelic loci with $MAF \geq 0.05$ (n=8457):
```{r, engine='bash'}
./SNP2BAYPASSsubpops.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order.txt \
raw/SBCC_Kinship.full.tsv SBCC_9K_BayPass.subpop.tsv 2> SBCC_9K_BayPass.subpop.log
head SBCC_9K_BayPass.subpop.log
echo ...
tail SBCC_9K_BayPass.subpop.log
```
The resulting [SBCC_9K_BayPass.subpop.tsv](SBCC_9K_BayPass.subpop.tsv) file can then be further processed with BayPass. In the example 30 parallel threads are created:
```{r, engine='bash', eval=FALSE}
baypass -npop 4 -gfile SBCC_9K_BayPass.subpop.tsv -outprefix SBCC_9K_BayPass_subpop \
-nthreads 30 -seed 12345 > SBCCmatrix.BayPass.subpop.log
mv SBCC_9K_BayPass_subpop* SBCCmatrix.BayPass.subpop.log BayPass
```
This command takes only a few minutes to run with defaults params:
```{r, engine='bash'}
tail BayPass/SBCCmatrix.BayPass.subpop.log
```
We can now read the produced omega covariance matrix and plot the Xtx subpopulation differentiation estimates as follows:
```{r XtXplot}
omega_subpop = as.matrix(x=
read.table(file="BayPass/SBCC_9K_BayPass_subpop_mat_omega.out", header=F) )
subpop.names=c("1","2","3","4")
dimnames(omega_subpop)=list(subpop.names,subpop.names)
XtX_BP = read.table(file="BayPass/SBCC_9K_BayPass_subpop_summary_pi_xtx.out",
header = T)
plot(XtX_BP$M_XtX)
```
Now, we can use the same omega covariance matrix to generate theoretically neutral SNPs to help in the interpretation of results and call outliers:
```{r outliers, warning=FALSE, eval=FALSE}
source('./BayPass/baypass_utils.R')
# get estimates of the Pi Beta distribution inferred from real data
pi.beta.coef=read.table("BayPass/SBCC_9K_BayPass_subpop_summary_beta_params.out",
h=T)$Mean
#upload the real SNP dataset to obtain total allele count
real.data<-geno2YN("SBCC_9K_BayPass.subpop.tsv")
sim = simulate.baypass(omega.mat=omega_subpop,nsnp=1000,sample.size=real.data$NN,
beta.pi=pi.beta.coef,pi.maf=0,suffix="sim")
```
We shall now run BayPass in "core mode" with the resulting neutral genotypes:
```{r, engine='bash', eval=FALSE}
mv *.sim BayPass
baypass -npop 4 -gfile BayPass/G.sim -outprefix sim \
-nthreads 30 -seed 12345 > sim.subpop.log
mv sim* BayPass
```
Finally we can take XtX values of neutral, simulated SNPs to infer a significance threshold and then plot the real XtX estimates:
```{r XtXplot_threshold}
maptype = 'cM';
#maptype = 'bp';
outTSV = paste("SBCC_9K_BayPass_subpop.xtx.",maptype,".tsv", sep="")
#get the pod XtX
sim.xtx=read.table("BayPass/sim_summary_pi_xtx.out",h=T)$M_XtX
#compute the 99th percentile
sim99=quantile(sim.xtx,probs=0.99)
sim99
# read real XtX estimates
XtX_BP = read.table(file="BayPass/SBCC_9K_BayPass_subpop_summary_pi_xtx.out",
header = T)
# read in SNP names and metadata
metadata = read.table(file="SBCC_9K_BayPass.subpop.log",header=F)
colnames(metadata) = c("MRK","SNPidentifier","allele1","allele2","missing","MAF")
# read genetic OR bp positions of 9K markers
if(maptype == "bp"){
genmap = read.table(file="raw/9920_SNPs_SBCC_bp_map2017.curated.tsv",header=T)
} else{
genmap = read.table(file="raw/9920_SNPs_SBCC_cM_map2017.curated.tsv",header=T)
}
# merge and export data
XtX_BP = merge( XtX_BP, metadata, by="MRK")
XtX_BP = merge( XtX_BP, genmap, by="SNPidentifier")
sortedXtXs = XtX_BP[with(XtX_BP, order(chr, cM, decreasing = F)),]
#sortedXtXs = XtX_BP[with(XtX_BP, order(chr, bp, decreasing = F)),]
write.table(sortedXtXs[,c("SNPidentifier","M_XtX","chr",maptype)],file=outTSV,sep="\t",row.names=F,col.names=T,quote=F)
```