-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathrun-BMA-AMD-n145.Rmd
247 lines (180 loc) · 9.64 KB
/
run-BMA-AMD-n145.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
title: "Multivariable Mendelian randomization: NMR metabolites as risk factors for age-related macular degeneration (AMD)"
author: "Verena Zuber"
date: "14/8/2021"
output: html_document
---
## Multivariable Mendelian randomization after removing outlier and influential points: NMR metabolites as risk factors for age-related macular degeneration (AMD)
## MR-BMA
```{r setup, include=FALSE}
#options for RMD
options(width = 120)
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.path='figures_bma145/')
```
First we load the relevant packages and functions.
```{r echo = FALSE, message=FALSE, warning=FALSE}
library(ggplot2)
source("summary_mvMR_SSS.R")
source("summary_mvMR_BF.R")
```
## 1. Loading the data and removing missing values and outliers and influential points and outliers from the first iteration
All relevant input data is provided in the Rdata "amd_example".
```{r message=FALSE, warning=FALSE}
load("amd_example")
betaX = amd_example$betaX
amd_beta = amd_example$amd_beta
amd_se = amd_example$amd_se
rs = amd_example$annotate[,1]
genes = amd_example$annotate[,7]
dim(betaX)
colnames(betaX)
rf = colnames(betaX)
```
There was one influential variant in the LIPC gene region and two outliers in the APOE and FUT2 gene region. We are removing these 3 data points for the following analysis.
```{r message=FALSE, warning=FALSE}
LIPC = which(genes == "LIPC")
FUT2 = which(genes == "FUT2")
APOE = which(genes == "APOE")
exclude_vec = c(LIPC,FUT2,APOE)
betaX=betaX[-exclude_vec,]
amd_beta = amd_beta[-exclude_vec]
amd_se = amd_se[-exclude_vec]
rs = rs[-exclude_vec]
genes = genes[-exclude_vec]
```
Next, we perform an inverse variance weighting (IVW) based on the standard error of the amd beta effect estimates prior to subsequent analysis.
```{r message=FALSE, warning=FALSE}
betaX_ivw = betaX / amd_se
amd_beta_ivw = amd_beta / amd_se
dim(betaX_ivw)
length(amd_beta_ivw)
```
## 2. Analysis: Univariable MR
Univariate MR for each metabolite, ranked by minimum p-value.
```{r message=FALSE, warning=FALSE}
beta_coeff = rep(0, ncol(betaX_ivw))
p_val = rep(0, ncol(betaX_ivw))
for(i in 1:ncol(betaX_ivw)){
lm_out = lm(amd_beta_ivw ~ betaX_ivw[,i]-1)
beta_coeff[i] =summary(lm_out)$coeff[1]
p_adjust =summary(lm_out)$coeff[4]*49
if(p_adjust>1){p_val[i] = 1}
else{p_val[i] = p_adjust}
}
mat_out = cbind(colnames(betaX_ivw),beta_coeff, p_val)
sort_p=sort(p_val, index.return=TRUE, decreasing = FALSE)
colnames(mat_out) = c("rf","beta with AMD","p")
mat_out[sort_p$ix,]
```
## 3. Analysis: Bayesian model selection with stochastic search
We create an object of class mvMRInput and we run MR-BMA allowing for models with up to 12 risk factors and a prior probability of 0.1 reflecting an a priori model size of three risk factors. Please note for runtime reasons max_iter is set to 100, please increase to 100k for stable results.
```{r message=FALSE, warning=FALSE}
set.seed(12345)
amd_nmr_input=new("mvMRInput", betaX = as.matrix(betaX_ivw), betaY = as.matrix(amd_beta_ivw), snps=rs, exposure=rf, outcome = "amd")
BMA_output=summarymvMR_SSS(amd_nmr_input,kmin=1,kmax=12, prior_prob=0.1, max_iter=100)
```
The best model with the highest posterior evidence:
```{r message=FALSE, warning=FALSE}
best_model_ind=as.numeric(unlist(strsplit(BMA_output@BestModel, ",")))
rf[best_model_ind]
BMA_output@BestModel_Estimate
```
What are the next best individual models?
```{r message=FALSE, warning=FALSE}
best.model.out = sss.report.best.model(BMA_output, prior_sigma=0.5, top = 10, write.out = TRUE, csv.file.name="amd_best_10models_n145")
best.model.out
```
MR-BMA output: Marginal inclusion probabilities and model-averaged effect (MACE) for each risk factor (top 10)
```{r message=FALSE, warning=FALSE}
mr.bma.out = sss.report.mr.bma(BMA_output, top = 10, write.out = TRUE, csv.file.name="amd_mr_bma_n145")
mr.bma.out
```
MR-BMA permutation: Compute empirical p-values based on permutations. For illustration reasons we only perform 100 permutations here. Ideally, the permutation should be repeated 100k times, Please note this may require a long runtime.
```{r message=FALSE, warning=FALSE}
permute_bma = create.permutations(BMA_output, nrepeat = 100, save.matrix=TRUE, file.name = "permutation_mrBMA.csv")
empirical.p = calculate.p(BMA_output, permute_bma)
empirical.p
```
## 4. Model diagnostics for outliers (Cook's D) and influential points (Q) for all models with pp>0.02
For model diagnostics we consider all individual models that had a posterior probability larger than 0.02. Modify this value (diag_ppthresh) accordingly for your study.
```{r message=FALSE, warning=FALSE}
diag_ppthresh = 0.02
nr_diag = length(which(best.model.out[,2] >= diag_ppthresh))
nr_diag
model_index = names(which(best.model.out[,2] >= diag_ppthresh))
model_index
```
The following loop computes
- the predicted genetic association (hat(beta_Y)) based on the model
- Cook's distance and its threshold for each genetic variant given the model
- the heterogeneity statistic q for each genetic variant given the model
```{r message=FALSE, warning=FALSE}
#initialise
title = rep("1", nr_diag)
predicted_amd = matrix(ncol=nr_diag, nrow=length(amd_beta_ivw))
cD = matrix(ncol=nr_diag, nrow=length(amd_beta_ivw))
cD_thresh = vector(length=nr_diag)
Q = matrix(ncol=nr_diag, nrow=length(amd_beta_ivw))
# fill with loop
for(i in 1:nr_diag){
print(as.numeric(unlist(strsplit(model_index[i], ","))))
if(length(as.numeric(unlist(strsplit(model_index[i], ","))))>1){
betaX_model = do.call(cbind, betaX_ivw[,as.numeric(unlist(strsplit(model_index[i], ",")))])
}
else{
betaX_model = as.matrix(betaX_ivw[,as.numeric(unlist(strsplit(model_index[i], ",")))])
}
title[i] = paste(rf[as.numeric(unlist(strsplit(model_index[i], ",")))],collapse=' + ')
sigma_vec = rep(0.5, ncol(betaX_model))
cD[,i] = cooksD(amd_beta_ivw,betaX_model,sigma_vec)$cooksD
cD_thresh[i] = cooksD(amd_beta_ivw,betaX_model,sigma_vec)$cooksD_thresh
H_fm = betaX_model %*% solve(t(betaX_model) %*% betaX_model + sigma_vec^{-2} ) %*% t(betaX_model)
predicted_amd[,i] = H_fm %*% amd_beta_ivw
Q[,i] = (amd_beta_ivw-predicted_amd[,i])^2
}
```
## Influential points via Cook's D
In multivariable MR diagnostics we look at scatterplots the predicted genetic association (hat(beta_Y)) based on the model including possibly more than one risk factor (note that hat(beta_Y) = theta betaX is a linear combination of the risk factors in the model) against the actually observed genetic associations with the outcome. Influential points (computed by Cooks distance (cD)) are marked by the regions the SNPs fall in where we use the median of a F distribution as threshold (saved in cD_thresh).
```{r, echo=FALSE, include = TRUE, fig.height = 10, fig.width = 11}
for(i in 1:nr_diag){
df = data.frame(x=predicted_amd[,i], y =amd_beta_ivw, cD = cD[,i], genes = genes)
print(ggplot(df, aes(x, y)) + geom_point(aes(colour = cD), size =4) + scale_colour_gradientn(colours = c("white", "orange", "red", "darkred"), values=c(0,0.027,0.1,0.5,1)) + labs(x = "predicted beta amd", y="observed beta amd", colour="Cooks D") + geom_hline(yintercept = 0, linetype="dotted") + geom_vline(xintercept = 0, linetype="dotted") + geom_text(aes(label=ifelse(cD>cD_thresh[i],as.character(genes),'')),hjust=0.5, vjust=-1, size=5) + theme(axis.text.x = element_text(size = 13), axis.text.y = element_text(size = 13), axis.title.x = element_text(size = 18), axis.title.y = element_text(size = 18), legend.text=element_text(size=16),legend.title=element_text(size=18)) + ggtitle(title[i]))
}
```
In numbers, here is a table of Cook's distance of the top models, sorted by maximum Cooks Distance in any of the models considered.
```{r message=FALSE, warning=FALSE}
maxCD=apply(cD, MARGIN=1, FUN=max)
sort.ix = sort.int(maxCD, decreasing=TRUE, index.return=TRUE)
cooksD_tab=cbind(rs,genes,round(cD,digits=3), round(maxCD,digits=3))
#colnames(cooksD_tab)=c("rs","region","cooksD1","cooksD2","cooksD3","cooksD4","max cooksD")
cooksD_tab[sort.ix$ix,][1:30,]
```
Threshold of Cook's distance according to F-statistic with d and n-d degrees of freedom.
```{r message=FALSE, warning=FALSE}
cD_thresh
for(i in 1:nr_diag){
print(genes[which(cD[,i] > cD_thresh[i])])
}
```
## Outliers with Q
The same diagnostic plots as above, but here the color code shows the q-statistic for each genetic variant used as instrumental variable. Outliers (computed by q>10) are marked by the regions the SNPs fall in.
```{r, echo=FALSE, include = TRUE, fig.height = 10, fig.width = 11}
for(i in 1:nr_diag){
df = data.frame(x=predicted_amd[,i], y =amd_beta_ivw, Q = Q[,i], genes = genes)
print(ggplot(df, aes(x, y)) + geom_point(aes(colour = Q), size =4) + scale_colour_gradientn(colours = c("white", "yellow", "green", "darkgreen"), values=c(0,0.027,0.1,0.5,1)) + labs(x = "predicted beta amd", y="observed beta amd", colour="Q") + geom_hline(yintercept = 0, linetype="dotted") + geom_vline(xintercept = 0, linetype="dotted") + geom_text(aes(label=ifelse(Q>10,as.character(genes),'')),hjust=0.5, vjust=-1, size=5) + theme(axis.text.x = element_text(size = 13), axis.text.y = element_text(size = 13), axis.title.x = element_text(size = 18), axis.title.y = element_text(size = 18), legend.text=element_text(size=16),legend.title=element_text(size=18)) + ggtitle(title[i]))
}
```
In numbers, here is a table of the q-statistic of the top models, sorted by the maximum in any of the models considered.
```{r message=FALSE, warning=FALSE}
maxQ=apply(Q, MARGIN=1, FUN=max)
sort.ix = sort.int(maxQ, decreasing=TRUE, index.return=TRUE)
Q_tab=cbind(rs,genes,round(Q,digits=3), round(maxQ,digits=3))
Q_tab[sort.ix$ix,][1:30,]
```
<!--
require(knitr)
require(markdown)
knit("run-BMA-AMD-n145.Rmd")
markdownToHTML('run-BMA-AMD-n145.md', 'run-BMA-AMD-n145.html', options=c("use_xhml"))
-->