HOWTOXtX_BAYPASS.Rmd

---
title: "XtX subpopulation differentiation (BayPass)"
output:
  html_document:
    fig_caption: yes
    highlight: zenburn
    theme: cerulean
    toc: yes
    toc_depth: 4
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 4
  word_document: default
date: '`r format(Sys.Date(),"%d/%b/%Y")`'
---

```{r knitr setup, include=FALSE,  eval=TRUE, echo=FALSE, warning=FALSE}
library(knitr)
knitr::opts_chunk$set(eval=TRUE, cache=FALSE, message=FALSE, warning=FALSE, 
                      comment = "", results="markup")
```

This document describes how SNP genotypes from the Spanish Barley Core Collection (SBCC) were checked and processed for further use with [BayPass v2.1](http://www1.montpellier.inra.fr/CBGP/software/baypass) 
to obtain XtX statistics about subpopulation differentiation.

In the next steps we convert the SNPs to fit the appropriate formats for downstream analyses. 
SNPs conserve sample order of climate data. 

We shall write a **SNPSFILE** containing allele counts across populations/barleys, where each SNP is represented by two lines in the file, with the counts of allele 1 on the first line and the counts for allele 2 on the second, and son on. The counts of allele 1 and allele 2 are assumed to sum to the sample size typed at this SNP in this population (i.e. the total sample size excluding missing data). 

This means that **PAV markers must be converted to SNP-like markers** if they are to be used.

First, we shall write a **genotyping file** containing allele counts across populations/barleys, where each biallelic SNP is represented by two columns per row, with the counts of allele 1 on the first line and the counts for allele 2 on the second. The counts of allele 1 and allele 2 are assumed to sum to the sample size typed at this SNP in this population (i.e. the total sample size excluding missing data). 

Second, SNPs will be used to compute a **covariance matrix** which captures the background similarity between landraces. 


## Formatting SNPs

We wrote a Perl script ([*SNP2BAYPASS.pl*](./SNP2BAYPASS.pl)) to carry out these formatting tasks. The initial set of 9,920 Infinium and GBS markers in file [*9920_SNPs_SBCC_50K.tsv*](raw/9920_SNPs_SBCC_50K.tsv), was converted as follows, accepting **10% missing data per position** and accepting only biallelic loci with $MAF \geq 0.05$ (n=8457):
```{r, engine='bash'}
./SNP2BAYPASS.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order.txt \
  SBCC_9K_BayPass.tsv 2> SBCC_9K_BayPass.log 

head SBCC_9K_BayPass.log 
echo ...
tail SBCC_9K_BayPass.log
```
The resulting [*SBCC_9K_BayPass.tsv*](./SBCC_9K_BayPass.tsv) file is the **genotyping file** required by BayPass. Note that file [*SBCC_9K_BayPass.annot.tsv*](./SBCC_9K_BayPass.annot.tsv) with matching fullnames of SNPs (columns) is produced alongside.

As BayPass does not handle missing covariate (environmental) data, 
produce also a subset of accessions for which complete env data is available, excluding SBCC036:

```{r complete_env, engine='bash'}
./SNP2BAYPASS.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order_complete_env.txt \
  SBCC_9K_BayPass_complete_env.tsv 2> SBCC_9K_BayPass.complete_env.log 

head SBCC_9K_BayPass.complete_env.log 
echo ...
tail SBCC_9K_BayPass.complete_env.log

# skip column 30 in environfile, which corresponded to SBCC036
perl -lane 'print join("\t",@F[0 .. 28, 30 .. $#F])' SBCC_environfile.tsv \
  > SBCC_environfile_complete_env.tsv
```

## Obtaining a covariance matrix (omega)

As mentioned earlier, we need to estimate a covariance matrix by invoking  
a system-available binary of BayPass, called 'baypass'. 
The following command computes a matrix based on all 9K Infinium/GBS markers:

```{r, engine='bash', eval=FALSE}
baypass -npop 135 -gfile SBCC_9K_BayPass.tsv -outprefix SBCC_9K_BayPass \
  -nthreads 30 -seed 12345 > SBCCmatrix.BayPass.log

mv SBCC_9K_BayPass_* SBCCmatrix.BayPass.log BayPass

# after leaving out SBCC036 to get a complete environmental data set
baypass -npop 134 -gfile SBCC_9K_BayPass_complete_env.tsv -outprefix SBCC_9K_BayPass_complete_env \
  -nthreads 30 -seed 12345 > SBCCmatrix.BayPass.complete_env.log

mv SBCC_9K_BayPass_complete_env_* SBCCmatrix.BayPass.complete_env.log BayPass
```

## Population differentiation XtX analysis

A similar Perl script ([*SNP2BAYPASSsubpops.pl*](./SNP2BAYPASSsubpops.pl)) was written to format the same SNP set for the analysis of the 4 subpopulations of SBCC barleys defined by STRUCTURE (see protocol [HOWTOstructure](./HOWTOstructure.html)). The set of 9,920 Infinium and GBS markers in file [*9920_SNPs_SBCC_50K.tsv*](raw/9920_SNPs_SBCC_50K.tsv), was converted as follows, accepting **10% missing data per position** and accepting only biallelic loci with $MAF \geq 0.05$ (n=8457):

```{r, engine='bash'}
./SNP2BAYPASSsubpops.pl raw/9920_SNPs_SBCC_50K.tsv SBCC_order.txt \
  raw/SBCC_Kinship.full.tsv SBCC_9K_BayPass.subpop.tsv 2> SBCC_9K_BayPass.subpop.log 

head SBCC_9K_BayPass.subpop.log 
echo ...
tail SBCC_9K_BayPass.subpop.log
```

The resulting [SBCC_9K_BayPass.subpop.tsv](SBCC_9K_BayPass.subpop.tsv) file can then be further processed with BayPass. In the example 30 parallel threads are created:

```{r, engine='bash', eval=FALSE}
baypass -npop 4 -gfile SBCC_9K_BayPass.subpop.tsv -outprefix SBCC_9K_BayPass_subpop \
  -nthreads 30 -seed 12345 > SBCCmatrix.BayPass.subpop.log

mv SBCC_9K_BayPass_subpop* SBCCmatrix.BayPass.subpop.log BayPass
```

This command takes only a few minutes to run with defaults params:
```{r, engine='bash'}
tail BayPass/SBCCmatrix.BayPass.subpop.log
```

We can now read the produced omega covariance matrix and plot the Xtx subpopulation differentiation estimates as follows:
```{r XtXplot}
omega_subpop = as.matrix(x=
  read.table(file="BayPass/SBCC_9K_BayPass_subpop_mat_omega.out", header=F) )
subpop.names=c("1","2","3","4")
dimnames(omega_subpop)=list(subpop.names,subpop.names)

XtX_BP = read.table(file="BayPass/SBCC_9K_BayPass_subpop_summary_pi_xtx.out",
                    header = T)
plot(XtX_BP$M_XtX)
```

Now, we can use the same omega covariance matrix to generate theoretically neutral SNPs to help in the interpretation of results and call outliers:

```{r outliers, warning=FALSE, eval=FALSE}
source('./BayPass/baypass_utils.R')

# get estimates of the Pi Beta distribution inferred from real data
pi.beta.coef=read.table("BayPass/SBCC_9K_BayPass_subpop_summary_beta_params.out",
                        h=T)$Mean

#upload the real SNP dataset to obtain total allele count
real.data<-geno2YN("SBCC_9K_BayPass.subpop.tsv")

sim = simulate.baypass(omega.mat=omega_subpop,nsnp=1000,sample.size=real.data$NN,
                  beta.pi=pi.beta.coef,pi.maf=0,suffix="sim")
```

We shall now run BayPass in "core mode" with the resulting neutral genotypes:

```{r, engine='bash', eval=FALSE}
mv *.sim BayPass
baypass -npop 4 -gfile BayPass/G.sim -outprefix sim \
  -nthreads 30 -seed 12345 > sim.subpop.log

mv sim* BayPass
```

Finally we can take XtX values of neutral, simulated SNPs to infer a significance threshold and then plot the real XtX estimates:

```{r XtXplot_threshold}
maptype = 'cM'; 
#maptype = 'bp';
outTSV = paste("SBCC_9K_BayPass_subpop.xtx.",maptype,".tsv", sep="")

#get the pod XtX
sim.xtx=read.table("BayPass/sim_summary_pi_xtx.out",h=T)$M_XtX

#compute the 99th percentile
sim99=quantile(sim.xtx,probs=0.99)
sim99 

# read real XtX estimates
XtX_BP = read.table(file="BayPass/SBCC_9K_BayPass_subpop_summary_pi_xtx.out",
                    header = T)

# read in SNP names and metadata
metadata = read.table(file="SBCC_9K_BayPass.subpop.log",header=F)
colnames(metadata) = c("MRK","SNPidentifier","allele1","allele2","missing","MAF")

# read genetic OR bp positions of 9K markers
if(maptype == "bp"){
  genmap = read.table(file="raw/9920_SNPs_SBCC_bp_map2017.curated.tsv",header=T)
} else{
  genmap = read.table(file="raw/9920_SNPs_SBCC_cM_map2017.curated.tsv",header=T)
}

# merge and export data
XtX_BP = merge( XtX_BP, metadata, by="MRK")
XtX_BP = merge( XtX_BP, genmap, by="SNPidentifier")
sortedXtXs = XtX_BP[with(XtX_BP, order(chr, cM, decreasing = F)),]
#sortedXtXs = XtX_BP[with(XtX_BP, order(chr, bp, decreasing = F)),]
write.table(sortedXtXs[,c("SNPidentifier","M_XtX","chr",maptype)],file=outTSV,sep="\t",row.names=F,col.names=T,quote=F)
```