-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathrun-other-methods-AMD-n145.Rmd
200 lines (143 loc) · 6.94 KB
/
run-other-methods-AMD-n145.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
title: "Multivariable Mendelian randomization: NMR metabolites as risk factors for age-related macular degeneration (AMD)"
author: "Verena Zuber"
date: "11/4/2019"
output: html_document
---
## Multivariable Mendelian randomization: NMR metabolites as risk factors for age-related macular degeneration (AMD)
## Competing approaches n=145
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.path='figures_others/')
```
```{r echo = FALSE, message=FALSE, warning=FALSE}
library(ggplot2)
library(reshape2)
library(MRChallenge2019)
source("mvMR_ivw-methods.R")
source("mvMR_regression.R")
source("mvMR_lars.R")
source("mvMR_glmnet.R")
```
## 1. Loading the data and removing missing values and outliers and influential points from the first iteration
There are two missing values in the AMD data, let's remove them, which reduces the sample size to n=148. Additionally, we remove outliers (FUT2 and APOE) and influential points (LIPC) from the first iteration which reduces the numbers of genetic variants to n=145.
```{r message=FALSE, warning=FALSE}
remove.na=which(is.na(Challenge_dat$beta_amd)==TRUE)
LIPC = which(Challenge_dat$predicted_gene == "LIPC")
FUT2 = which(Challenge_dat$predicted_gene == "FUT2")
APOE = which(Challenge_dat$predicted_gene == "APOE")
exclude_variants = c(LIPC,FUT2,APOE)
exclude_vec = sort(c(remove.na,exclude_variants))
Challenge_dat.na_rm = Challenge_dat[-exclude_vec,]
Challenge_dat.na_rm2 = Challenge_dat[-remove.na,]
```
The outcome of interest is AMD, the respective summary-level data on genetic association with AMD (from Fritsch et al http://amdgenetics.org/) and its respective standard error is stored in the vectors beta_amd and se_amd.
```{r message=FALSE, warning=FALSE}
colnames(Challenge_dat)
amd_beta = Challenge_dat.na_rm$beta_amd
amd_se = Challenge_dat.na_rm$se_amd
hist(amd_beta)
```
As risk factors we consider NMR metabolites, of which we use summary-level data for genetic associations from a large meta-analysis as described in Kettunen et al
http://computationalmedicine.fi/data#NMR_GWAS . We work with the genetic associations saved in bmat_in and their respective p-values.
```{r message=FALSE, warning=FALSE}
bmat_in = Challenge_dat.na_rm[,32:149]
pmat_in = Challenge_dat.na_rm2[,150:267]
```
One of the biggest challenges of multivariable MR is the design of a meaningful study, in particular the choice of both the genetic variants and the risk factors.
A risk factor can be included into the analysis if the following criteria (RF1-RF2) hold:
-RF1 Relevance: The risk factor needs to be strongly instrumented by at least one genetic variant included as instrumental variable.
-RF2 No multi-collinearity: The genetic associations of any risk factor included cannot be linearly explained by the genetic associations of any other risk factor or by the combinations of genetic associations of multiple other risk factors included in the analysis.
In order to avoid multi-collinearity (RF2), we have pre-select and include only lipoprotein measurements on total cholesterol content, triglyceride content, and particle diameter. For the various fatty acid measurements we only included total fatty acids.
```{r message=FALSE, warning=FALSE}
index = c(1,2 ,3,4,5,6,11,12,15,21,22,23,24,26,27,28,29,34,35,36,47,53,55,56,57,58,59,70,76,79,80,83,84,87,92,93,94,95,96,97,98,99,100,101,102,108,111,115,118)
bmat_in2 = bmat_in[,index]
pmat_in2 = pmat_in[,index]
```
Additionally we need to make sure to include only risk factor needs to be strongly instrumented by at least one genetic variant included as instrumental variable (RF1). In the manuscript we suggest to include only risk factors that have at least one genome-wide significant instrument. This reduces the risk factor set to d=30 potential risk factors we include into the analysis. Note, we do include the same risk factors as in the first iteration.
```{r message=FALSE, warning=FALSE}
minp=apply(pmat_in2, MARGIN=2, FUN=min)
which(minp>5e-8)
betaX=bmat_in2[,which(minp<5e-8)]
dim(betaX)
colnames(betaX)
rf = colnames(betaX)
```
We further save the rs identifier and the annotated gene by VEP by ENSEMBL https://www.ensembl.org/info/docs/tools/vep/index.html.
```{r message=FALSE, warning=FALSE}
rs = Challenge_dat.na_rm$rsid
genes = Challenge_dat.na_rm$predicted_gene
```
Next, we perform an inverse variance weighting (IVW) based on the standard error of the amd beta effect estimates prior to subsequent analysis.
```{r message=FALSE, warning=FALSE}
betaX_ivw = betaX / amd_se
amd_beta_ivw = amd_beta / amd_se
```
Here is the correlation structure between genetic associations in the data set.
```{r, echo=FALSE,results='hide',fig.keep='all', fig.height = 15, , fig.width = 16}
cor_beta = cor(betaX_ivw)
melted_cormat = melt(cor_beta)
print(
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Correlation \n between betas") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1))+
coord_fixed()+
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
)
```
Data input in format mvMRInput
```{r message=FALSE, warning=FALSE}
amd_nmr_input=new("mvMRInput", betaX = as.matrix(betaX_ivw), betaY = as.matrix(amd_beta_ivw), snps=rs, exposure=rf, outcome = "AMD")
```
## 2. IVW
```{r message=FALSE, warning=FALSE}
ivw=mvmr_ivw(amd_nmr_input, intercept = FALSE)
score=abs(ivw@Estimate/ivw@StdError)
sort_object=sort.int(score, index.return=TRUE, decreasing=TRUE)
cbind(rf, ivw@Estimate, score)[sort_object$ix,]
```
## 3. IVW - regression (based on the lm function)
```{r message=FALSE, warning=FALSE}
wlm = mvmr_wreg(amd_nmr_input)
score= -log10(wlm@Pval)
sort_object=sort.int(score, index.return=TRUE, decreasing=TRUE)
cbind(rf, wlm@Estimate, wlm@Pval)[sort_object$ix,]
```
## 4. Lars (based on the lars package)
```{r message=FALSE, warning=FALSE}
set.seed(123)
lars_cv=lars_mr(amd_nmr_input, cv=TRUE, intercept = FALSE)
lars_cv
score=abs(lars_cv@Estimate)
sort_object=sort.int(score, index.return=TRUE, decreasing=TRUE)
cbind(rf, lars_cv@Estimate)[sort_object$ix,]
```
## 5. Lasso (based on the glmnet package)
```{r message=FALSE, warning=FALSE}
set.seed(123)
lasso=glmnet_l1_mr(amd_nmr_input, cv=TRUE, cv.param="lambda.min")
score=abs(lasso@Estimate)
sort_object=sort.int(score, index.return=TRUE, decreasing=TRUE)
cbind(rf, lasso@Estimate)[sort_object$ix,]
```
## 6. Elastic net (based on the glmnet package)
```{r message=FALSE, warning=FALSE}
set.seed(123)
enet=glmnet_enet_mr(amd_nmr_input, cv=TRUE, cv.param="lambda.min")
score=abs(enet@Estimate)
sort_object=sort.int(score, index.return=TRUE, decreasing=TRUE)
cbind(rf, enet@Estimate)[sort_object$ix,]
```
<!--
require(knitr)
require(markdown)
knit("run-other-methods-AMD-n145.Rmd")
markdownToHTML('run-other-methods-AMD-n145.md', 'run-other-methods-AMD-n145.html', options=c("use_xhml"))
-->