-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfeed.xml
668 lines (407 loc) · 102 KB
/
feed.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Notes</title>
<description></description>
<link>http://sashagusev.github.io/</link>
<atom:link href="http://sashagusev.github.io/feed.xml" rel="self" type="application/rss+xml" />
<pubDate>Thu, 16 Nov 2017 10:41:54 -0500</pubDate>
<lastBuildDate>Thu, 16 Nov 2017 10:41:54 -0500</lastBuildDate>
<generator>Jekyll v2.4.0</generator>
<item>
<title>Interpretation of TWAS and its vulnerabilities</title>
<description><p>The recent pre-print of <a href="#Wainberg:2017">(Wainberg et al. 2017)</a> examined the TWAS approach for prioritizing target genes in GWAS studies. The work has relevance to a broad class of methods for “post-GWAS” analysis and spurred interesting discussions. As an author of one of these methods <a href="#Gusev:2016">(Gusev et al. 2016)</a>, I thought it would be useful to talk about what TWAS is doing; why; and how it fits into our current understanding of complex traits.</p>
<p><strong>UPDATE: The authors respond <a href="https://medium.com/@Vulnerabilities/response-to-interpretation-of-twas-and-its-vulnerabilities-bc073e606e2c">here</a> with thoughtful points and useful additional context.</strong> Please read the whole thing, but I specifically want to echo their point that understanding how well TWAS (and other methods) prioritize causal genes is a critical question right now. I think one way to move toward an answer is by establishing clear metrics for what we consider a successful prioritization (ideally metrics that are agnostic of QTL evidence) and systematically evaluating the performance of all commonly used approaches (including manual curation).</p>
<h2 id="what-is-twas">What is TWAS?</h2>
<p>TWAS is a test for significant association between the cis component of gene expression and the GWAS trait. It is part of a broad class of approaches that identify relationships between QTL and GWAS to find target genes. These approaches have many applications: risk prediction; pathway enrichment; causal inference between traits; drug repurposing; identifying disease-associated epigenetic features; and gene prioritization. In their pre-print, WEA explore the application of TWAS to causal gene prioritization in a GWAS of Crohn’s disease and LDL cholesterol. First, they observe loci where TWAS identifies an association at the presumed causal gene, but also at other genes. Second, they switch to less appropriate tissues and find that TWAS often no longer observes an association for the presumed causal gene but observes associations to other genes. They conclude that the approach is therefore “invalid” for finding causal genes. This conclusion is driven by a misinterpretation of TWAS as a causal inference test rather than a test for association. Subsequent claims of “vulnerability”, “false discovery”, and “instability” stem from this misinterpretation. TWAS is neither vulnerable nor invulnerable to discovering non-causal genes because it is not a test for causality. Just as a GWAS study identifies significantly associated non-causal SNPs which need to be fine-mapped, so too a TWAS study will find significantly associated non-causal genes. WEA do not present any cases of false <em>associations</em>, which would be a true vulnerability for TWAS. Although WEA focus on our TWAS/FUSION method because it’s so easy to use [kidding!], they conclude that these vulnerabilities apply to many methods including SMR, PrediXcan, Sherlock, coloc, QTLMatch, eCAVIAR, enloc, RTC – which I’ll generally refer here to as QTL/GWAS methods.</p>
<h2 id="what-are-qtlgwas-methods-doing">What are QTL/GWAS methods doing?</h2>
<p>The goal of these methods is to estimate properties of the genetic relationship between gene expression and GWAS trait. They fall into two broad categories:</p>
<ol>
<li><strong>TWAS/FUSION, PrediXcan/MetaXcan, SMR</strong>: A test for <em>significant genetic correlation</em> between cis expression and GWAS. A helpful intuition is that when the cis expression of a gene is driven entirely by a single eQTL $i$, the resulting test statistic will be equal to the raw GWAS association of $i$. (Note: approximately for SMR because it incorporates the uncertainty on the eQTL … but SMR also restricts to eQTLs with very low uncertainty)</li>
<li><strong>coloc, eCAVIAR, enloc</strong>: An estimator of the <em>posterior probability of colocalization</em>, where colocalization is defined as one (or more) shared causal variants between the expression and GWAS. In the case of a single eQTL $i$, this probability will approach 1.0 only if the probability of the causal configuration containing $i$ for the GWAS is also high. In this example, TWAS association is necessary but not sufficient for colocalization. A secondary difference is that colocalization maxes out at a posterior of 1.0, regardless of the underlying effect-size, whereas TWAS also captures the strength of the genetic association. In this way the two approaches are complementary.</li>
</ol>
<p>To my knowledge, none of these methods claim to perform causal gene discovery (see footnote). Causal inference is challenging. It is especially challenging when you have a single snapshot of the phenotypes you’re interested in and a relatively small number of independent variants. So these approaches estimate support for less stringent gene-disease relationships: how strongly are the variants associated with expression also associated with the trait? what’s the probability that the same causal variants are shared between expression and the trait?</p>
<h2 id="why-do-we-care-about-these-genes-if-they-are-not-causal">Why do we care about these genes if they are not causal?</h2>
<p>While the QTL/GWAS methods do not perform causal inference, we still believe they are useful in prioritizing genes for experimental follow-up relative to other GWAS approaches. This is challenging to quantify because there are so few known examples of true-positive causal genes, and even fewer known examples of true-negatives. One nice aspect of the WEA analysis is that they tried to find such genes in the literature and then asked whether they show up as TWAS hits. In fact, most of the known genes DO came up as significant TWAS associations when evaluated in the expected tissue (though, keep in mind that many of these genes were initially identified precisely because they had eQTLs) and the number of null genes with associations is relatively low (assuming they are all truly null). I would humbly recommend the alternative title “<em>TWAS is actually pretty good at identifying association for causal genes</em>”. [<strong>UPDATE:</strong> <em>The authors point out in their <a href="https://medium.com/@Vulnerabilities/response-to-interpretation-of-twas-and-its-vulnerabilities-bc073e606e2c">response</a> that they only evaluated those genes that were TWAS associations and therefore did not assess sensitivity</em>]. That said, there are now multiple examples in the literature that these methods are doing something useful:</p>
<ul>
<li>In <a href="#Gusev:2016">(Gusev et al. 2016)</a> we showed that genes identified by a height TWAS were more correlated with measured height (out of sample) than the nearest gene or the best eQTL gene for a GWAS hit. That TWAS genes are more strongly associated with phenotype suggests that they are more likely to be causal than nearest genes.</li>
<li>Similarly, in <a href="#Gusev:2017">(Gusev et al. 2016)</a> we showed that a TWAS-based risk prediction model was more strongly correlated with schizophrenia than one based on corresponding top SNPs.</li>
<li>In <a href="#Mancuso:2017">(Mancuso et al. 2017)</a> we showed that the TWAS genes are more significantly associated than the nearest gene, have stronger eQTL effects at the index SNP, and yielded more consistent cross-trait genetic correlations than SNPs in the locus.</li>
<li><a href="#Fromer:2016">(Fromer et al. 2016)</a> used a combination of Sherlock/COLOC to select three genes for experimental validation in zebrafish and demonstrated that altering expression had a causal effect on neurodevelopment.</li>
<li><a href="#Barbeira:2017">(Barbeira et al. 2017)</a> showed that genes causally implicated in rare recessive diseases by ClinVar were more strongly associated with related common diseases by PrediXcan. Retrospectively, the genes identified by PrediXcan are thus more likely to be causal for rare disease than un-selected genes.</li>
<li><a href="#Marigorta:2017">(Marigorta et al. 2017)</a> showed that overall expression of COLOC-selected genes is much more strongly associated with disease and is predictive of future complications (in contrast to genetic risk scores). This is a beautiful study and, in my opinion, the clearest evidence that QTL/GWAS methods prioritize genes with a causal effect on trait and have prognostic value.</li>
</ul>
<h2 id="when-does-twas-have-more-power-than-gwas">When does TWAS have more power than GWAS?</h2>
<p>A lot of the work we did in <a href="#Gusev:2016">(Gusev et al. 2016)</a> was about understanding when the TWAS model does and does not increase power to identify new associations (again, the focus was on discovery and not fine-mapping the causal gene). So to summarize:</p>
<ul>
<li>TWAS will do better when multiple variants associated with expression are also associated with the disease. The increase in power comes from aggregating those effects into a single test. This includes the case when there is a single shared causal variant but it is partially tagged by multiple variants in the GWAS.</li>
<li>TWAS will do worse when the genetic effect on expression is independent of the trait, in which case TWAS becomes just an arbitrary transformation of the local SNPs.</li>
</ul>
<p>We specifically showed that “novel” genes identified by a TWAS of lipids (genes that did not overlap genome-wide significant SNPs) replicated better in a larger lipid GWAS study than non-TWAS loci at the same level of significance. We again demonstrated this for the educational attainment phenotype in <a href="#Mancuso:2017">(Mancuso et al. 2017)</a> where 4 out of 4 “novel” TWAS genes found in an early GWAS had genome-wide significant SNPs in a larger, more recent GWAS. I like this result a lot and I think it’s a gold-standard for showing that an association method improves power in real data. But I didn’t include it in the above list because it could still be consistent with non-causal TWAS associations (if, for example, eQTLs and causal variants tend to colocalize for uninteresting reasons such as proximity to promoters/enhancers, high minor allele frequency, high LD, etc).</p>
<p>In hindsight I think we focused a bit too much on novel locus discovery given the many other cool things people are now doing with TWAS, but there continue to be instances where TWAS yields target genes that GWAS cannot find <a href="#Gao:2017">(Gao et al. 2017; Hoffman et al. 2017)</a>.</p>
<h2 id="can-we-get-closer-to-causality">Can we get closer to causality?</h2>
<p>Language aside, WEA are correct to point out that TWAS can identify genes that do not appear to be causal in instances where GWAS SNPs regulate multiple genes. As with GWAS, this motivates fine-mapping methods that probabilistically reduce the list of putative target genes without losing the causal gene. WEA are pessimistic about this research direction: “<em>Is it possible, despite the limitations of TWAS, to somehow perform statistical fine-mapping and determine the causal gene or genes? We believe that it is not, even in principle.</em>” But I disagree: if the genetically driven causal expression effect is observed in the tested reference panel then it will have a genetic association with the disease and (in principle) be identifiable by TWAS. Together with my colleague Nick Mancuso, we recently <a href="https://ep70.eventpilotadmin.com/web/page.php?page=IntHtml&amp;project=ASHG17&amp;id=170122028">presented</a> such a fine-mapping approach, which identifies credible sets of genes that contain the causal gene at pre-defined level (e.g. 90%), and others are working on similar ideas. As in GWAS fine-mapping, there is an assumption that the causal effect is observed in the study, but the results from WEA in LDL and Crohn’s suggest that this is true enough in real data to be worth investigating.</p>
<h2 id="how-important-is-the-right-cell-type">How important is the right cell-type?</h2>
<p>The other observation WEA describe is the phenomenon of “instability”, where TWAS associations are observed in one reference tissue but not in other reference tissues. The meaning of “instability” is fuzzy (well, negatively fuzzy), but such a TWAS association is no more unstable than an eQTL that is significant in some tissues and not others. Indeed, in <a href="#Gusev:2016">(Gusev et al. 2016)</a> we showed that genes with significant predictors can differ between studies even in the same tissue (Table S2). We believed that this was primarily due to QC/environmental differences between studies because predictors translated well across studies (Table S3). More recent work by <a href="#Liu:2017">(Liu et al. 2017)</a> using data from GTEx estimated a mean genetic correlation of 0.75 across 11 tissues. In other words, the optimal predictor built in the “wrong” tissue will, on average, have a correlation of 0.75 to the optimal predictor in the right tissue.</p>
<p>An interesting caveat demonstrated in the latest <a href="#GTEX:2017">(GTEx Consortium et al. 2017)</a> paper is that disease associated variants are enriched at tissue-specific eQTLs. However, even when we restrict the GTEx genetic correlation analysis to gene sets that are constrained in ExAC/ClinGen or near known GWAS hits, the mean genetic correlation across pairs of tissues remains high (albeit significantly lower than other genes), as shown in this (unpublished) figure across all pairs of GTEx tissues:</p>
<p><img src="http://sashagusev.github.io/images/plot_twas_rg.png" alt="genomic TWAS power" /></p>
<p>While it is important to be reminded of tissue differences, the cross-tissue genetic similarity is high enough to justify using all of the data that is available to us. The amount of genetic sharing is much lower, however, for trans/distal effects on expression, which brings us to a big question:</p>
<h2 id="does-twas-matter-in-an-omnigenic-age">Does TWAS matter in an omnigenic age?</h2>
<p>The recent perspective from <a href="#Boyle:2017">(Boyle, Li, and Pritchard 2017)</a> proposed a model of disease where tens-of-thousands of variants have minor (and biologically uninteresting) effects on disease, which cascade through a small set of “core” genes. Recent expression studies have shown that nearly every gene has a cis-eQTL in some tissue, so, optimistically, TWAS can help prioritize the core genes at large-effect GWAS loci.</p>
<p>But let’s imagine an extreme case where core genes are not under cis regulation. Does TWAS provide any benefits? My answer is “No, but wait”. Under this architecture TWAS would identify thousands of cis-regulated non-core genes, which provide no deep biological meaning when interrogated experimentally. This is the premise of futility some folks see in the omnigenic model. But wait! As expression studies increase in size they will have power to predict the <em>full component</em> of expression rather than just cis effects. For core genes, TWAS performed on these genome-wide components will aggregate the thousands of individually miniscule effects that are mediated by non-core genes. These genome-wide associations between expression and trait should be highly significant and robust. With sufficient sample size, causal inference tests across the hundreds/thousands of variants that hit core genes could lend further support to the underlying association.</p>
<p>How practical is genomewide prediction of expression? Let’s do some simple math. Assume a trait with 500k genotyped individuals and SNP-heritability=0.60, 200 core genes and core gene expression SNP-heritability=0.15. Assuming heritability is uniformly distributed across each core gene and a very simple polygenic predictor of expression from <a href="#Daetwyler:2008">(Daetwyler, Villanueva, and Woolliams 2008)</a> we can ask how significant the resulting TWAS association would be as a function of the training size:</p>
<p><img src="http://sashagusev.github.io/images/plot_twas_power.png" alt="genomic TWAS power" /></p>
<p>The dashed line is genome-wide significance, which is overkill for this analysis, but even for that cutoff all it takes is 70,000 individuals with measured expression. For comparison, GEO currently contains 1.2M human samples. Multiple studies are gearing up for whole-genome sequencing of 100,000 samples, with costs comparable to transcriptomics. It is, at most, a matter of years before we start to have an answer.</p>
<hr />
<p><em>Thanks to Bogdan Pasaniuc for helpful comments</em></p>
<p><strong>Footnote.</strong>
Statements on causality from <a href="#Gusev:2016">(Gusev et al. 2016)</a>:</p>
<ul>
<li>“We developed a new approach to identify genes whose expression is significantly associated to complex traits in individuals without directly measured expression levels (Methods) … Our approach can be conceptualized as a test for significant cis-genetic correlation between expression and trait (see Results).”</li>
<li>“Our proposed method shares conceptual similarities with 2-sample Mendelian randomization approaches that aim to identify causal relations between traits using genetic variation predictions as a randomizer. However, while Mendelian randomization is intended to quantify the total causal effect, our method has the less strict goal of identifying significant associations.”</li>
<li>“An alternative confounder arises from independent effects on phenotype and expression at the same SNP/tag (Figure 2G, Methods). Such instances could be indistinguishable from the desired causal model (Methods) without analyzing individual-level data, though we believe they are still biologically interesting cases of co-localization.”</li>
</ul>
<p><strong>References</strong></p>
<ol class="bibliography"><li><span id="Wainberg:2017">Wainberg, Michael, Nasa Sinnott-Armstrong, David Knowles, David Golan, Raili Ermel, Arno Ruusalepp, Thomas Quertermous, et al. 2017. “Vulnerabilities Of Transcriptome-Wide Association Studies.” <i>BioRxiv</i>. Cold Spring Harbor Laboratory. doi:10.1101/206961.</span></li>
<li><span id="Gusev:2016">Gusev, Alexander, Arthur Ko, Huwenbo Shi, Gaurav Bhatia, Wonil Chung, Brenda W J H Penninx, Rick Jansen, et al. 2016. “Integrative Approaches for Large-Scale Transcriptome-Wide Association Studies.” <i>Nature Genetics</i> 48 (3): 245–52. doi:10.1038/ng.3506.</span></li>
<li><span id="Gusev:2017">Gusev, Alexander, Nick Mancuso, Hilary K Finucane, Yakir Reshef, Lingyun Song, Alexias Safi, Edwin Oh, et al. 2016. “Transcriptome-Wide Association Study of Schizophrenia and Chromatin Activity Yields Mechanistic Disease Insights.” <i>BioRxiv</i>. Cold Spring Harbor Laboratory. doi:10.1101/067355.</span></li>
<li><span id="Mancuso:2017">Mancuso, Nicholas, Huwenbo Shi, Pagé Goddard, Gleb Kichaev, Alexander Gusev, and Bogdan Pasaniuc. 2017. “Integrating Gene Expression With Summary Association Statistics to Identify Genes Associated with 30 Complex Traits.” <i>American Journal of Human Genetics</i> 100 (3): 473–87. doi:10.1016/j.ajhg.2017.01.031.</span></li>
<li><span id="Fromer:2016">Fromer, Menachem, Panos Roussos, Solveig K Sieberts, Jessica S Johnson, David H Kavanagh, Thanneer M Perumal, Douglas M Ruderfer, et al. 2016. “Gene Expression Elucidates Functional Impact of Polygenic Risk for Schizophrenia.” <i>Nature Neuroscience</i> 19 (11): 1442–53. doi:10.1038/nn.4399.</span></li>
<li><span id="Barbeira:2017">Barbeira, Alvaro N., Scott P. Dickinson, Jason M. Torres, Rodrigo Bonazzola, Jiamao Zheng, Eric S. Torstenson, Heather E. Wheeler, et al. 2017. “Exploring The Phenotypic Consequences of Tissue Specific Gene Expression Variation Inferred from GWAS Summary Statistics.” <i>BioRxiv</i>. Cold Spring Harbor Laboratory. doi:10.1101/045260.</span></li>
<li><span id="Marigorta:2017">Marigorta, Urko M, Lee A Denson, Jeffrey S Hyams, Kajari Mondal, Jarod Prince, Thomas D Walters, Anne Griffiths, et al. 2017. “Transcriptional Risk Scores Link GWAS to EQTLs and Predict Complications in Crohn’s Disease.” <i>Nature Genetics</i> 49 (10): 1517–21. doi:10.1038/ng.3936.</span></li>
<li><span id="Gao:2017">Gao, Guimin, Brandon L Pierce, Olufunmilayo I Olopade, Hae Kyung Im, and Dezheng Huo. 2017. “Trans-Ethnic Predicted Expression Genome-Wide Association Analysis Identifies a Gene for Estrogen Receptor-Negative Breast Cancer.” <i>PLoS Genetics</i> 13 (9): e1006727. doi:10.1371/journal.pgen.1006727.</span></li>
<li><span id="Hoffman:2017">Hoffman, Joshua D, Rebecca E Graff, Nima C Emami, Caroline G Tai, Michael N Passarelli, Donglei Hu, Scott Huntsman, et al. 2017. “Cis-EQTL-Based Trans-Ethnic Meta-Analysis Reveals Novel Genes Associated with Breast Cancer Risk.” <i>PLoS Genetics</i> 13 (3): e1006690. doi:10.1371/journal.pgen.1006690.</span></li>
<li><span id="Liu:2017">Liu, Xuanyao, Hilary K Finucane, Alexander Gusev, Gaurav Bhatia, Steven Gazal, Luke O’Connor, Brendan Bulik-Sullivan, et al. 2017. “Functional Architectures Of Local and Distal Regulation of Gene Expression in Multiple Human Tissues.” <i>American Journal of Human Genetics</i> 100 (4): 605–16. doi:10.1016/j.ajhg.2017.03.002.</span></li>
<li><span id="GTEX:2017">GTEx Consortium, Data Analysis &amp;Coordinating Center (LDACC)—Analysis Working Group Laboratory, Statistical Methods groups—Analysis Working Group, Enhancing GTEx (eGTEx) groups, NIH Common Fund, NIH/NCI, NIH/NHGRI, et al. 2017. “Genetic Effects on Gene Expression across Human Tissues.” <i>Nature</i> 550 (7675): 204–13. doi:10.1038/nature24277.</span></li>
<li><span id="Boyle:2017">Boyle, Evan A, Yang I Li, and Jonathan K Pritchard. 2017. “An Expanded View Of Complex Traits: From Polygenic to Omnigenic.” <i>Cell</i> 169 (7): 1177–86. doi:10.1016/j.cell.2017.05.038.</span></li>
<li><span id="Daetwyler:2008">Daetwyler, Hans D, Beatriz Villanueva, and John A Woolliams. 2008. “Accuracy Of Predicting the Genetic Risk of Disease Using a Genome-Wide Approach.” <i>PloS One</i> 3 (10): e3395. doi:10.1371/journal.pone.0003395.</span></li></ol>
</description>
<pubDate>Sun, 29 Oct 2017 00:00:00 -0400</pubDate>
<link>http://sashagusev.github.io/2017-10/twas-vulnerabilities.html</link>
<guid isPermaLink="true">http://sashagusev.github.io/2017-10/twas-vulnerabilities.html</guid>
<category>papers</category>
</item>
<item>
<title>Hot takes: interesting papers from June</title>
<description><h2 id="transcriptomeproteome">Transcriptome/Proteome</h2>
<p><strong><a href="http://www.nature.com/nature/journal/v534/n7608/full/nature18270.html">Defining the consequences of genetic variation on a proteome-wide scale, Chick et al. Nature</a></strong></p>
<p>“This study quantified both protein and transcript abundance in a genetically diverse population of mice, mapping their genetic architecture … We conclude that most local pQTL affected both protein and transcript abundance, consistent with a transcriptional mode of regulation. However, distant pQTL affected protein abundance independently of the transcript, consistent with a post- transcriptional mode of regulation.”</p>
<p><strong><a href="http://science.sciencemag.org/content/352/6293/1586">Neuronal subtypes and diversity revealed by single-nucleus RNA sequencing of the human brain, Lake et al. Science</a></strong></p>
<p>“Our results demonstrate that postmortem SNS [single-nucleus RNA sequencing] can identify expected and previously unidentified neuronal subtypes that provide insight into brain function through distinct profiles of activity- defining genes”</p>
<h2 id="epigenome">Epigenome</h2>
<p><strong><a href="http://www.nature.com/nrg/journal/vaop/ncurrent/abs/nrg.2016.59.html">The molecular hallmarks of epigenetic control, Allis et al. Nat Rev Genet</a></strong></p>
<p>“Here, we provide a personal perspective on the development of epigenetics, from its historical origins to what we define as ‘the modern era of epigenetic research’.”</p>
<p><strong><a href="http://journals.plos.org/plosgenetics/article?id=10.1371%2Fjournal.pgen.1006105">Epigenome-wide Association Studies and the Interpretation of Disease -Omics, Birney et al. PLOS Genet</a></strong></p>
<p>“As EWAS have generally been only rarely performed with concurrent genotyping of the same individuals or transcriptional studies of the same cells, we have no way of knowing whether the positive results of EWAS to date are testing the starting hypothesis that genuine epigenetic changes occur within a subset of cells in the population. Instead, the results may be due to residual meta-epigenomic effects of cell subtypes or attributable to untested influences of genomic or transcriptomic variability. This being the case, and with similar caveats affecting transcriptomic studies, no EWAS to date can be said to be fully interpretable … Analytically, insights into DNA sequence variants upon DNA methylation (methylation quantitative trait loci, mQTLs) for the cell type studied will allow approaches to be developed to account for this major influence upon the epigenome. One particular approach, two- step mendelian randomization, is being applied in prospective and case/control EWAS, build- ing on the non-modifiable nature of germline genetic variation to provide causal anchors within a causal inference setting.”</p>
<p><strong><a href="http://www.nature.com/nature/journal/vaop/ncurrent/full/nature18606.html">The landscape of accessible chromatin in mammalian preimplantation embryos, Wu et al. Nature</a></strong></p>
<p>“Here we provide a genome-wide survey of accessible chromatin in the mouse preimplantation embryos using ATAC-seq. A fundamental question in preimplantation development is to what extent gene expression is linked to epigenome reprogramming. Our data indicate that gene activation and establishment of open chromatin could occur, at least in part, through different pathways from those for epigenetic modification reprogramming.”</p>
<h2 id="heritabilitycorrelation">Heritability/correlation</h2>
<p><strong><a href="http://www.cell.com/ajhg/abstract/S0002-9297(16)30148-3">Contrasting the Genetic Architecture of 30 Complex Traits from Summary Association Data, Shi et al. AJHG</a></strong></p>
<p>“Here, we introduce methods that estimate the total trait variance explained by the typed variants at a single locus in the genome (local SNP heritability) from genome-wide association study (GWAS) summary data while accounting for linkage disequilibrium among variants. We applied our estimator to ultra-large-scale GWAS summary data of 30 common traits and diseases to gain insights into their local genetic architecture. First, we found that common SNPs have a high contribution to the heritability of all studied traits. Second, we identified traits for which the majority of the SNP heritability can be confined to a small percentage of the genome. Third, we identified GWAS risk loci where the entire locus explains significantly more variance in the trait than the GWAS reported variants. Finally, we identified loci that explain a significant amount of heritability across multiple traits.”</p>
<p><strong><a href="http://www.cell.com/ajhg/pdfExtended/S0002-9297(16)30135-5">Transethnic Genetic-Correlation Estimates from Summary Statistics, Brown et al. AJHG</a></strong></p>
<p>“We have developed transethnic genetic-effect and genetic- impact correlations and provided a method for estimating these quantities on the basis of only summary-level GWAS information and suitable reference panels … In all pheno- types analyzed, the genetic correlation was significantly different from both 0 and 1.”</p>
<p><strong><a href="http://www.cell.com/ajhg/pdf/S0002-9297(16)30132-X.pdf">Imputing Phenotypes for Genome-wide Association Studies, Hormozdiari et al. AJHG</a></strong></p>
<p>“We propose an approach called phenotype imputation that allows one to perform a GWAS on a phenotype that is difficult to collect. Our approach leverages the correlation structure between multiple phenotypes to impute the uncollected phenotype … Our method can impute the summary statistic of the target phenotype as a weighted linear combination of the summary statistics of related phenotypes.”</p>
<p><strong><a href="http://genome.cshlp.org/content/early/2016/06/14/gr.201996.115.abstract">Multikernel linear mixed models for complex phenotype prediction, Weissbrod et al. Gen Res</a></strong></p>
<p>“Here, we present multikernel linear mixed model (MKLMM), a flexible modeling framework that allows for both global and local high-order interactions modeled via RKHS, as well as modeling of a heterogeneous effect-size distribution … MKLMM-Adapt held a statistically significant advantage over AMB [adaptive multi-BLUP] in prediction of Crohn’s disease (CD), type 1 diabetes (T1D), and ulcerative colitis (UC), whereas AMB did not hold a statistically significant advantage over MKLMM-Adapt in any data set “</p>
<h2 id="computational-phasing">Computational phasing</h2>
<p><strong><a href="http://www.nature.com/ng/journal/v48/n7/full/ng.3571.html">Fast and accurate long-range phasing in a UK Biobank cohort, Loh et al. Nat Genet</a></strong></p>
<p>“The basic idea of our approach is to harness IBD from distant relatedness (up to ~12 generations from a common ancestor) that is pervasive within very large cohorts … we observed that Eagle analysis of all N ≈ 150,000 samples together completed three times faster than SHAPEIT2 10 × 15,000 analysis while achieving a 67% (1%) decrease in switch error rate: switch error rate of 0.30% (0.01%) for Eagle 1 × 150,000 versus 0.90% (0.06%) for SHAPEIT2 10 × 15,000 … corresponding to perfect phase in a majority of 10-Mb segments”</p>
<p><strong><a href="http://www.nature.com/ng/journal/v48/n7/full/ng.3583.html">Haplotype estimation for biobank-scale data sets, O’Connell et al. Nat Genet</a></strong></p>
<p>“SHAPEIT3 enhances SHAPEIT2 in two ways that enable it to deal with very large data sets, such as the UKB study. The first advance is based on the intuition that larger sample sizes are likely to result in increased local similarity between groups of haplotypes due to the higher probability of more recent shared ancestry … The second advance involves changes to the Markov chain Monte Carlo (MCMC) sampling routine that result in additional gains in speed. As sample size grows, it becomes more likely that two individuals will have a long stretch of sequence in common within a particular window … We removed the trio parents from the data set and phased the whole of chromosome 20 (16,265 genotyped sites) for the remaining 152,112 individuals. This run resulted in a median switch error rate of 0.4% and took 38.5 h using ten threads.</p>
<h2 id="popgen">Popgen</h2>
<p><strong><a href="http://www.nature.com/nature/journal/v534/n7606/full/nature17993.html">The genetic history of Ice Age Europe, Fu et al. Nature</a></strong></p>
<p>“Here we analyse genome-wide data from 51 Eurasians from ~45,000–7,000 years ago. Over this time, the proportion of Neanderthal DNA decreased from 3–6% to around 2%, consistent with natural selection against Neanderthal variants in modern humans. Whereas there is no evidence of the earliest modern humans in Europe contributing to the genetic composition of present-day Europeans, all individuals between ~37,000 and ~14,000 years ago descended from a single founder population which forms part of the ancestry of present-day Europeans.”</p>
</description>
<pubDate>Mon, 04 Jul 2016 00:00:00 -0400</pubDate>
<link>http://sashagusev.github.io/2016-07/papers.html</link>
<guid isPermaLink="true">http://sashagusev.github.io/2016-07/papers.html</guid>
<category>papers</category>
</item>
<item>
<title>Hot takes: interesting papers from Apr/May</title>
<description><p><em>Better late than never</em></p>
<h2 id="gwascausal-mechanisms">GWAS/causal mechanisms</h2>
<p><strong><a href="http://www.nature.com/nature/journal/v533/n7604/full/nature17671.html">Genome-wide association study identifies 74 loci associated with educational attainment, Okbay et al. Nature</a></strong></p>
<p><em>The Supplemental Materials to this paper contain the most thorough and clear description of cutting-edge GWAS analyses I have read of late</em></p>
<p><strong><a href="http://www.nature.com/nature/journal/v533/n7601/full/nature17939.html">Parkinson-associated risk variant in distal enhancer of α-synuclein modulates target gene expression, Soldner et al. Nature</a></strong></p>
<p>“Here we describe an alternative experimental approach to identify functional risk variants based on three recent innovations in genetics and molecular biology: (i) the prioritization of GWAS-identified risk variants in regulatory elements such as distal enhancers annotated based on genome-scale epigenetic data; (ii) the generation of genetically controlled isogenic pluripotent stem cell lines in which specific disease-associated genetic variants are the sole modified experimental variable using efficient gene-editing technologies such as the CRISPR/Cas9 system; and (iii) the analysis of cis-acting effects of candidate variants on allele-specific gene expression through deletion or exchange of disease-associated regulatory elements.”</p>
<p><strong><a href="http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3570.html">Detection and interpretation of shared genetic influences on 42 human traits, Pickrell et al. Nat Genet</a></strong></p>
<p>“It is also striking to note how many genetic variants influence multiple traits but without a consistent correlation in effect sizes … Another possibility is that a given genetic variant often influences the function of multiple cell types through separate molecular pathways or that the effects of a variant on two related phenotypes vary according to an individual’s environmental exposures.”</p>
<h2 id="regulatory-mechanisms">Regulatory mechanisms</h2>
<p><strong><a href="http://www.nature.com/nmeth/journal/v13/n4/abs/nmeth.3799.html">Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases, Marbach et al. Nat Methods</a></strong></p>
<p>“For most traits, evidence of increased connectivity between perturbed genes extended to variants that did not pass the genome-wide significance threshold, indicating that regulatory network information will be useful for prioritizing candidate variants.”</p>
<p><strong><a href="http://www.cell.com/cell/abstract/S0092-8674(16)30339-7">Pooled ChIP-Seq Links Variation in Transcription Factor Binding to Complex Disease Risk, Tehranchi et al. Cell</a></strong></p>
<p>“Altogether, 3,601 of our bQTLs have been previously implicated by GWAS, either directly or indirectly via LD (r2 &gt; 0.8 in YRI). These represent 2,282 different disease-associated variants, 995 (44%) of which are associated with bQTLs for multiple factors. Interestingly, this is 8.0-fold higher than the overall fraction of bQTLs associated with multiple factors, suggesting that variants affecting multiple TFs are more likely to impact phenotypes … intersecting GWAS loci with all binding sites of a TF may yield misleading results because these overlaps are dominated by SNPs with no effect on TF binding.”</p>
<p><strong><a href="http://www.pnas.org/content/113/23/6508.abstract">Syntax compensates for poor binding sites to encode tissue specificity of developmental enhancers, Farley et al. PNAS</a></strong></p>
<p>“Surprisingly, enhancers with low-affinity binding sites can mediate robust tissue specific patterns of gene expression when they are organized with optimal syntax. Such enhancers may be a vastly underappreciated feature of the regulatory genome.”</p>
<p><strong><a href="http://genome.cshlp.org/content/early/2016/05/03/gr.200535.115.abstract">Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks, Kelley et al. Gen Res</a></strong></p>
<p>“We introduce an open source package Basset to apply CNNs [deep convolutional neural networks] to learn the functional activity of DNA sequences from genomics data. We trained Basset on a compendium of accessible genomic sites mapped in 164 cell types by DNase-seq and demonstrate greater predictive accuracy than previous methods. Basset predictions for the change in accessibility between variant alleles were far greater for GWAS SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them.”</p>
<p><strong><a href="http://www.nature.com/ng/journal/v48/n5/abs/ng.3539.html">Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin, Whalen et al. Nat Genet</a></strong></p>
<p>“The resulting models accurately predict individual enhancer–promoter interactions across multiple cell lines with a false discovery rate up to 15 times smaller than that obtained using the closest gene … Most of this signature is not proximal to the enhancers and promoters but instead decorates the looping DNA.”</p>
<h2 id="gene-expression">Gene expression</h2>
<p><strong><a href="http://science.sciencemag.org/content/352/6285/600.long">RNA splicing is a primary link between genetic variation and disease, Li et al. Science</a></strong></p>
<p>“About ~65% of expression quantitative trait loci (eQTLs) have primary effects on chromatin, whereas the remaining eQTLs are enriched in transcribed regions … splicing QTLs are major contributors to complex traits, roughly on a par with variants that affect gene expression levels.”</p>
<p><strong><a href="http://www.cell.com/ajhg/abstract/S0002-9297(16)00071-9">Imputing Gene Expression in Uncollected Tissues Within and Beyond GTEx, Wang et al. AJHG</a></strong></p>
<p>“By analyzing data from nine selected tissue types in the GTEx pilot project, we demonstrated that harnessing expression quantitative trait loci (eQTLs) and tissue-tissue expression-level correlations can aid imputation of transcriptome data from uncollected GTEx tissues. More importantly, we showed that by using GTEx data as a reference, one can impute expression levels in inaccessible tissues in non-GTEx expression studies.”</p>
</description>
<pubDate>Fri, 17 Jun 2016 00:00:00 -0400</pubDate>
<link>http://sashagusev.github.io/2016-06/papers.html</link>
<guid isPermaLink="true">http://sashagusev.github.io/2016-06/papers.html</guid>
<category>papers</category>
</item>
<item>
<title>Hot takes: interesting papers from March</title>
<description><p><em>Intriguing papers that were published in the previous month, with highlights.</em></p>
<h2 id="gene-expression">Gene Expression</h2>
<p><strong><a href="http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3538.html">Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets, Zhu et al. Nat Genet</a></strong></p>
<p>“We propose a method (called SMR) that integrates summary-level data from GWAS with data from expression quantitative trait locus (eQTL) studies to identify genes whose expression levels are associated with a complex trait because of pleiotropy … In the SMR analysis of five complex traits, we initially identified associations for 289 genes by the SMR test. However, 185 of the 289 genes did not pass the subsequent HEIDI test (PHEIDI &lt; 0.05), suggesting that the majority of the associations identified by the SMR test could be explained by linkage due to the large number of cis-eQTLs widely spread across the genome … We observed from the analysis of five complex human traits that about two-thirds of the genes identified by SMR were not the genes nearest the top GWAS SNPs.”</p>
<p><strong><a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005908">Insight into Genotype-Phenotype Associations through eQTL Mapping in Multiple Cell Types in Health and ImmuneMediated Disease, Peters et al. PLoS Gen</a></strong></p>
<p>“We performed eQTL mapping in five primary immune cell types from patients with active inflammatory bowel disease (n = 91), anti-neutrophil cytoplasmic antibody-associated vasculitis (n = 46) and healthy controls (n = 43), revealing eQTLs present only in the context of active inflammatory disease. Moreover, we show that following treatment a proportion of these eQTLs disappear. Through joint analysis of expression data from multiple cell types, we reveal that previous estimates of eQTL immune cell-type specificity are likely to have been exaggerated”</p>
<p><strong><a href="http://www.nature.com/ng/journal/v48/n4/full/ng.3513.html">A multiple-phenotype imputation method for genetic studies, Dahl et al. Nat Genet</a></strong></p>
<p>“Here we have proposed a general method to impute missing phenotypes in samples with arbitrary levels of relatedness and population structure and missingness patterns … We are extending the model to test a SNP for association with multiple phenotypes, using a spike-and-slab mixture prior on effect sizes to allow for only a subset of phenotypes to be associated. Incorporating significant SNPs into our model would likely increase imputation accuracy … Higher-dimensional data sets, such as ‘three-dimensional’ gene expression experiments across multiple samples, genes and tissues, also have missing ‘phenotypes’ that may be reliably imputed to boost signal in downstream analyses.”</p>
<p><strong><a href="http://www.cell.com/ajhg/fulltext/S0002-9297(16)00071-9">Imputing Gene Expression in Uncollected Tissues Within and Beyond GTEx, Wang et al. AJHG</a></strong></p>
<p>“In this work, we developed multi-tissue imputation methods to impute gene expression in uncollected or inaccessible tissues … we propose a mixed-model-based
random-forest approach … our proposed methods impute multi-tissue expression levels on the basis of eQTLs, tissue-tissue expression-level correlations, and tissue-specific PCs of expression data and harness genetic factors, major developmental biological factors, and environmental factors. Additionally, our MixRF approach captures the dominant and recessive eQTL effects [(such that 58%, 38%, or 4% of the eQTL expression pairs better fit an additive,
dominant, or recessive eQTL model, respectively)], as well as the interactions among eQTLs, tissue types, and other factors.”</p>
<h2 id="popgen">Popgen</h2>
<p><strong><a href="http://www.nature.com/nature/journal/v531/n7593/full/nature17143.html">Sex speeds adaptation by altering the dynamics of molecular evolution, McDonald et al. Nature</a></strong></p>
<p>“Together, our results show that sex increases the rate of adaptation both by combining beneficial mutations into the same background and by separating deleterious mutations from advantageous backgrounds that would otherwise drive them to fixation. In other words, sex makes natural selection more efficient at sorting beneficial from deleterious mutations.”</p>
<p><strong><a href="http://www.cell.com/current-biology/abstract/S0960-9822(16)30247-0">The Combined Landscape of Denisovan and Neanderthal Ancestry in Present-Day Humans, Sankararaman et al. Current Bio</a></strong></p>
<p>“In Oceanians, the average size of Denisovan fragments is larger than Neanderthal fragments, implying a more recent average date of Denisovan admixture in the history of these populations. We document more Denisovan ancestry in South Asia than is expected based on existing models of history, reflecting a previously undocumented mixture related to archaic humans. Denisovan ancestry, just like Neanderthal ancestry, has been deleterious on a modern human genetic background, as reflected by its depletion near genes. Finally, the reduction of both archaic ancestries is especially pronounced on chromosome X and near genes more highly expressed in testes than other tissues. This suggests that reduced male fertility may be a general feature of mixtures of human populations diverged by &gt;500,000 years.”</p>
<h2 id="gwas-causal-mechanisms">GWAS causal mechanisms</h2>
<p><strong><a href="http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3529.html">Genetic risk for autism spectrum disorders and neuropsychiatric variation in the general population, Robinson et al. Nat Genet</a></strong></p>
<p>“Using several large ASD consortium and population-based resources (total n &gt; 38,000),
we find genome-wide genetic links between ASDs and typical variation in social behavior and adaptive functioning … These results suggest that familiality should be studied in a manner beyond a count of categorically affected family members and that trait variation in controls can provide insight into the underlying etiology of severe neurodevelopmental and psychiatric disorders”</p>
<p><strong><a href="http://science.sciencemag.org/content/352/6281/91">A long noncoding RNA associated with susceptibility to celiac disease, Castellans-Rubio et al. Science</a></strong></p>
<p>“The studies presented here identify lnc13 as a previously unrecognized lncRNA that harbors CeDassociated SNPs; demonstrate that lnc13 is degraded by Dcp2 after NF-kB activation; and, most importantly, show that lnc13 is able to regulate the expression of a subset of CeD-associated inflammatory genes through interaction with chromatin and the multifunctional protein hnRNPD. We believe that lnc13 plays a role in the maintenance of intestinal mucosal immune homeostasis and that dysregulation of lnc13 expression and function—as a result of decapping and genetic polymorphisms, respectively—contributes to inflammation in autoimmune disorders such as CeD”</p>
<p><strong><a href="http://science.sciencemag.org/content/351/6278/1166">Rare variant in scavenger receptor BI raises HDL cholesterol and increases risk of coronary heart disease, Zanoni et al. Science </a></strong></p>
<p>“Through targeted sequencing of coding regions of lipid-modifying genes in 328 individuals with extremely high plasma HDL-C levels, we identified a homozygote for a loss-of-function
variant, in which leucine replaces proline 376 (P376L), in SCARB1, the gene encoding SR-BI. The P376L variant impairs post-translational processing of SR-BI and abrogates selective HDL cholesterol uptake in transfected cells … Our results are consistent with a growing theme in HDL biology that steadystate concentrations of HDL-C are not causally protective against CHD and that HDL function and cholesterol flux may be more important than absolute levels.”</p>
</description>
<pubDate>Tue, 05 Apr 2016 00:00:00 -0400</pubDate>
<link>http://sashagusev.github.io/2016-04/papers.html</link>
<guid isPermaLink="true">http://sashagusev.github.io/2016-04/papers.html</guid>
<category>papers</category>
</item>
<item>
<title>A causal mechanisms cookbook</title>
<description><p>A few years back I saw a talk by <a href="http://www.cotsapaslab.info/">Chris Cotsapas</a> drawing a comparison between performing genomewide association studies (GWAS) and putting together a parts list before figuring out how to fix a broken car. This analogy contextualizes the role of GWAS as necessary and crucial, but also highlights where the real payoff is - using the parts list to identify specific mechanisms. Since then, GWAS has identified thousands of parts, and in the past year we have started to see how those parts fit into the larger complex disease machine. Two recent studies - one from last month looking at schizophrenia and one from last year looking at obesity - start with GWAS loci and end with specific causal mechanisms. Both studies are clearly the culmination of much hard work and offer uniquely valuable insights. Here, I wanted to focus on aspects that are generalizable.</p>
<p>Ideally, we would design an algorithm that takes all available biological features and spits out very high posteriors on causality at these loci. Perhaps the algorithm would even discard those features that are irrelevant. Is such an algorithm feasible? Is there a recipe for inferring causality? To start answering these questions, I’ve summarized the structure of these two studies in the same way one would write a recipe for a cookbook. I’ve made some notes at the end on things that struck me, but I would be very interested in hearing alternative theories as well as examples from other diseases.</p>
<h2 id="obesity--ftoirx">Obesity : FTO/IRX</h2>
<p><em>Smemo et al. “Obesity-associated variants within FTO form long-range functional connections with IRX3” 2014 Nature</em></p>
<p>This paper used the 3D chromatin structure to identify a functional connection between the well-studied FTO locus for obesity and distal gene IRX3; identified enhancers that modulate IRX3; and showed that Irx3-deficient mice have significant body-weight differences.</p>
<p><em>Claussnitzer et al. “FTO Obesity Variant Circuitry and Adipocyte Browning in Humans” 2015 NEJM</em></p>
<p>This paper builds on the previous model to further disentangle the mechanism: a specific causal variant disrupts a repressor motif, the corresponding transcription factors de-represseses an enhancer, this leads to doubling of IRX3/IRX5 expression, and creates a developmental shift in adipocytes. Multiple biological assays were used to validate each step of this mechanism, including knockdown and CRISPR editing in human cells (this validation is the meat of the paper but I won’t focus on it here).</p>
<p><img src="http://sashagusev.github.io/images/figure_claussnitzer.png" alt="locus from Claussnitzer et al" /></p>
<ol>
<li>Perform large-scale GWAS. The strongest genome wide association for BMI lies in introns 1 and 2 of the FTO gene. Focus on this locus.</li>
<li>Use epigenetic annotations from 127 cell-types (the ROADMAP project) to predict the cell-type most likely to be causal. Identify unusually long enhancer in adipocyte progenitors. Measure association between risk haplotype and enhancer activity (2.4x higher) to confirm tissue-specificity.</li>
<li>Use Hi-C data to identify a Topologically Active Domain (TAD) around the lead GWAS SNP and find chromatin interactions with nearby genes, restricting potential causal gene set to those within the TAD.</li>
<li>Identify genes with genotype-associated expression (IRX3/IRX5) from the causal set.</li>
<li>Measure genomewide expression in relevant tissue from risk/non-risk allele carriers and identify differentially expressed genes. Infer targeted cellular processes based on the pathways these genes are in.</li>
<li>Use PMCA method predict causal variant. In short, using weight matrices from known motif families and cross-species analysis, count the number of (a) conserved motifs; (b) multiple consecutively conserved motifs (called “modules”). Measure enrichment of motifs, motifs in modules, and modules relative to local shuffling and identify highest scoring SNP (and the motifs they disrupt).</li>
<li>Multiple motifs were implicated, but <em>ARID5B</em> had highest expression in relevant (adipose) tissue and expression of <em>ARID5B</em> gene was correlated with expression of IRX3/IRX5 in controls.</li>
</ol>
<h2 id="schizophrenia--mhcc4">Schizophrenia : MHC/C4</h2>
<p><em>Sekar et al. “Schizophrenia risk from complex variation of complement component 4” 2016 Nature</em></p>
<p>This paper looks at GWAS for schizophrenia in the complicated MHC region, identifies a relationship between copy number and expression of the C4A and C4B genes, and shows that the subsequent expression of the gene is associated with schizophrenia. Multiple biological assays are used to localize the specific neuronal tissues and mechanisms related to these genes and implicate synaptic pruning.</p>
<p><img src="http://sashagusev.github.io/images/figure_sekar.png" alt="locus from Sekar et al" /></p>
<ol>
<li>Perform a large-scale GWAS. The strongest GWAS hit for schizophrenia is in the MHC, with the most significant SNPs lying near the C4 gene. There is a known association to CSMD1 (on chr8) which codes for a regulator of C4. Focus on this gene.</li>
<li>Measure the copy number of each C4 gene was using molecular methods in 162 HapMap CEU samples. Identify four common structural haplotypes.</li>
<li>Identify SNP haplotypes that correlate with the C4 structural haplotypes. Multiple SNP-haplotypes correlated with individual structural haplotypes (but not vice versa).</li>
<li>Measure expression of five regions from 674 post-mortem brains and quantify relationship between C4 copy number and expression. RNA expression was correlated to copy number of C4 and isotypes.</li>
<li>Construct genetic/structural predictors of C4A/C4B expression levels in the brain. Fit <script type="math/tex">E \sim \sum_j{\beta_j d_j }</script> where E is expression and <script type="math/tex">d_j</script> is the number of structural elements of type <script type="math/tex">j</script>. These explain 71% and 42% of the variance in expression, more than any single cis-eQTL.</li>
<li>Construct SNP predictors of C4 alleles. Use standard haplotype imputation software (BEAGLE) to predict into large individual-level GWAS data (N=65,000). Associate the imputed expression with schizophrenia as well as local SNPs.</li>
</ol>
<h2 id="thoughts">Thoughts</h2>
<p>My first thought is that this takes a <em>lot</em> of work. And that’s partially due to the fact that <strong>both GWAS SNPs were actually tagging unobserved biological mediators between genetics and expression</strong> (copy number for C4A/B and an enhancer for IRX3/5). The optimistic model where the lead GWAS SNP directly disrupts the nearest gene did not apply. Indeed one of the key findings of the Smemo et al. paper is that the target gene can lie &gt;1MB away from the causal SNP and still interact in 3D chromatin space. Sekar et al. demonstrated how genetic prediction of expression from a small targeted study into a big GWAS helped confirm the impact of C4A/B in a way that could not have been done directly. This is an approach that myself and other groups have been thinking a lot about and I believe it’s a very powerful framework for integrating such mediators across many studies. In both instances, <strong>one link of the causal mechanism was made by looking at genes that code for relevant regulators</strong>. Sekar et al. focused on the C4 genes because they were regulated by a known schizophrenia-associated gene. Claussnitzer et al. focused on the ARID5B motif because the corresponding gene was co-expressed with IRX3/5 and associated with adipogenesis.</p>
<p><img src="http://sashagusev.github.io/images/figure_c4a_gtex.png" alt="expression of C4A" /></p>
<p>The relationship to tissue-specificity is also complicated. Using the GTEx resource (which has generously made data and analysis tools publicly available), we can see above that neither gene is blatantly specific to a relevant tissue: adipose ranks 17th for expression of <a href="http://gtexportal.org/home/gene/IRX3">IRX3</a>, and brain ranks 29th for expression of <a href="http://gtexportal.org/home/gene/C4A">C4A</a>. However, both loci were validated using expression in the relevant tissue, and used absence of expression in other tissues to short-list the possible causal pathways. <strong>Simple cross-tissue comparisons did not point to a single, clearly relevant tissue</strong>. That said, finding the right tissue was crucial to understanding the underlying mechanism. In the case of C4, the regions of the brain where these effects were most active implicated synaptic pruning. Whereas in the case of IRX, a genetic effect on expression was not observed in whole-adipose tissue, but only in preadipocytes. Background cross-tissue variability appears to be high enough to mask true underlying tissue-specific effects.</p>
<p>As more loci are characterized in this rigorous way, it will be interesting to see if these observations continue to hold up or if other patterns emerge.</p>
</description>
<pubDate>Sun, 20 Mar 2016 00:00:00 -0400</pubDate>
<link>http://sashagusev.github.io/2016-03/causal.html</link>
<guid isPermaLink="true">http://sashagusev.github.io/2016-03/causal.html</guid>
<category>gwas</category>
</item>
<item>
<title>Hot takes: interesting papers from Feb</title>
<description><p><em>Intriguing papers that were published in the previous month, with highlights.</em></p>
<h2 id="functional-annotation">Functional annotation</h2>
<p><strong><a href="http://journals.plos.org/plosgenetics/article?id=10.1371%2Fjournal.pgen.1005875">Which Genetics Variants in DNase-Seq Footprints Are More Likely to Alter Binding?, Moyerbrailean et al. PLOS Genet</a></strong></p>
<p>“Here, we integrated DNaseI footprinting data with sequence-based transcription factor (TF) motif models to predict the impact of a genetic variant on TF binding across 153 tissues and 1,372 TF motifs … As an example, the enrichment for LDL level-associated SNPs is 9.1-fold higher among SNPs predicted to affect HNF4 binding sites than in a background model already including tissue-specific annotation.”</p>
<p><strong><a href="http://www.nature.com/ng/journal/v48/n3/full/ng.3507.html">Weighting sequence variants based on their annotation increases power of whole-genome association studies, Sveinbjornsson et al. Nat Genet</a></strong></p>
<p>“Functional annotations can be used to improve power to detect associations. To increase power by incorporating the information that these annotations hold, we suggest a weighting scheme based on the observed enrichment using a weighted Bonferroni correction similar to that suggested by Roeder et al. for linkage analysis … we note that the number of significant signals using the enrichments as weights (n = 166) was not far from being optimal (n = 185), and this approach had more power than the standard Bonferroni method (n = 146) to detect associations … The commonly accepted P = 5 × 10^−8 threshold is outdated and will not be applicable in future GWAS.”</p>
<p><strong><a href="http://www.nature.com/ng/journal/v48/n2/full/ng.3477.html">A spectral approach integrating functional genomic annotations for coding and noncoding variants, Ionita-Laza et al. Nat Genet</a></strong></p>
<p>“Here we develop an unsupervised approach to integrate these different annotations into one measure of functional importance that, unlike most existing methods, is not based on any labeled training data. We show that the resulting meta-score has better discriminatory ability using disease-associated and putatively benign variants from published studies (in both coding and noncoding regions) than the recently proposed CADD score.”</p>
<h2 id="gene-expression">Gene expression</h2>
<p><strong><a href="http://www.cell.com/ajhg/abstract/S0002-9297(16)00004-5">A Burden of Rare Variants Associated with Extremes of Gene Expression in Human Peripheral Blood, Zhao et al. AJHG</a></strong></p>
<p>“After sequencing 2-kb promoter regions of 472 genes in 410 healthy adults, we performed a quadratic regression of rare variant count on bins of peripheral blood transcript abundance from microarrays. The overall burden test results are consistent with rare and private regulatory variants driving high or low transcription at specific loci, potentially contributing to disease.”</p>
<h2 id="gwas">GWAS</h2>
<p><strong><a href="http://journals.plos.org/plosgenetics/article?id=10.1371%2Fjournal.pgen.1005765">G = E: What GWAS Can Tell Us about the Environment, Gage et al. PLOS Genet</a></strong></p>
<p>“as large, richly phenotyped cohort studies (e.g., UK Biobank) emerge, it will become possible to identify modifiable exposures from genetic data and to dissect those pathways within the same cohort … A failure to appreciate this point will hamper our ability to translate the results of GWAS into health benefits, by focusing attention on possible biological pathways when, in fact, the target for intervention could be a modifiable environmental or behavioural exposure.”</p>
<p><strong><a href="http://www.nature.com/nature/journal/v530/n7589/full/nature16549.html">Schizophrenia risk from complex variation of complement component 4, Sekar et al. Nature</a></strong></p>
<p>“Schizophrenia’s strongest genetic association at a population level involves variation in the major histocompatibility complex (MHC) locus, but the genes and molecular mechanisms accounting for this have been challenging to identify. Here we show that this association arises in part from many structurally diverse alleles of the complement component 4 (C4) genes … These results implicate excessive complement activity in the development of schizophrenia and may help explain the reduced numbers of synapses in the brains of individuals with schizophrenia.”</p>
<p><strong><a href="http://www.cell.com/ajhg/fulltext/S0002-9297(15)00514-5">A Robust Example of Collider Bias in a Genetic Association Study, Day et al. AJHG</a></strong></p>
<p>“In summary, we have demonstrated that adjusting for causally associated covariates can create apparently highly robust, but actually biologically spurious, associations. The extent of this collider bias is almost perfectly inversely related to the strength of the exposure-collider association. Consideration of causal inference modeling and unadjusted test statistics is therefore of great importance in the design and interpretation of genetic (and non-genetic1) association studies.”</p>
<h2 id="popgen">Popgen</h2>
<p><strong><a href="http://www.nature.com/nature/journal/v530/n7591/full/nature16544.html">Ancient gene flow from early modern humans into Eastern Neanderthals, Kuhlwilm et al. Nature</a></strong></p>
<p>“We conclude that in addition to later interbreeding events, the ancestors of Neanderthals from the Altai Mountains and early modern humans met and interbred, possibly in the Near East, many thousands of years earlier than previously thought.”</p>
<p><strong><a href="http://science.sciencemag.org/content/351/6275/aaf3945">Erratum for the Report “Ancient Ethiopian genome reveals extensive Eurasian admixture in Eastern Africa”, Gallego et al. Science</a></strong></p>
<p>“A script necessary to convert the input produced by samtools v0.1.19 to be compatible with PLINK was not run when merging the ancient genome, Mota, with the contemporary populations SNP panel, leading to homozygote positions to the human reference genome being dropped as missing data (the analysis of admixture with Neandertals and Denisovans was not affected) … the geographic extent of the genetic impact of this migration was overestimated: The Western Eurasian backflow mostly affected East Africa and only a few Sub-Saharan populations; the Yoruba and Mbuti do not show higher levels of Western Eurasian ancestry compared to Mota.”</p>
<p><strong><a href="http://www.cell.com/ajhg/fulltext/S0002-9297(16)00011-2">The Kalash Genetic Isolate? The Evidence for Recent Admixture, Hellenthal et al. AJHG</a></strong></p>
<p>“These observations indicate that, contrary to the claim of Ayub et al. that the ancestors of the Kalash have been isolated from the ancestors of other extant populations for over 8,000 years, there is in fact strong evidence that they have not been isolated over this time frame.”</p>
<p><strong><a href="http://genome.cshlp.org/content/26/3/279.full">Whole-genome sequence analyses of Western Central African Pygmy hunter-gatherers reveal a complex demographic history and identify candidate genes under positive natural selection, Hsieh et al. Gen Res</a></strong></p>
<p>“we sequenced the genomes of four Biaka Pygmies … we fit models using the joint allele frequency spectrum … Our two best-fit models both suggest ancient divergence between the ancestors of the farmers and Pygmies, 90,000 or 150,000 yr ago.”</p>
<p><strong><a href="http://genome.cshlp.org/content/26/3/291.full">Model-based analyses of whole-genome data reveal a complex evolutionary history involving archaic introgression in Central African Pygmies, Hsieh et al. Gen Res</a></strong></p>
<p>“Our inference method rejects the hypothesis that the ancestors of [anatomically modern humans] were genetically isolated in Africa, thus providing the first whole genome-level evidence of African archaic admixture. Our inferences also suggest a complex human evolutionary history in Africa, which involves at least a single admixture event from an unknown archaic population into the ancestors of AMH, likely within the last 30,000 yr.”</p>
</description>
<pubDate>Tue, 01 Mar 2016 00:00:00 -0500</pubDate>
<link>http://sashagusev.github.io/2016-03/papers.html</link>
<guid isPermaLink="true">http://sashagusev.github.io/2016-03/papers.html</guid>
<category>papers</category>
</item>
<item>
<title>Even more on 'Limitations of GCTA'</title>
<description><p>The discussion of “Limitations of GCTA…” has now grown to three articles and I encourage readers interested in the mechanics of this model to read each of them:</p>
<ul>
<li><a href="http://biorxiv.org/content/early/2016/01/20/036574">Yang et al, “Commentary on ‘Limitations of GCTA’”</a>, rebutting all empirical claims in the original manuscript.</li>
<li><a href="http://biorxiv.org/content/early/2016/02/17/039594">Krishna-Kumar et al, “Response to Commentary”</a>, reiterating all empirical claims in the original manuscript and looking at additional datasets.</li>
<li><a href="http://biorxiv.org/content/early/2016/02/18/040055">Gamazon and Park, “SNP-based heritability estimation”</a>, rebutting the fundamental mathematical claim in the original manuscript.</li>
</ul>
<p>I greatly appreciate the passion that SK have shown in thinking and writing about this issue critically and from multiple different perspectives. However, my previous skepticism stands: there is no conclusive theoretical result showing that GCTA is biased when model assumptions are met, and the empirical results are confounded by relatedness. On the first point, I fear turning this into a pile-on so I’ll leave my thoughts below the fold for those interested in the boring details. On the second point, the authors have confirmed that they indeed included genetically related individuals and moved on to a new dataset (which is of mixed ancestry, introducing a host of additional complexities).</p>
<p>This exchange also highlights the ups and downs of non-peer-reviewed/immediate pre-prints. On the one hand, it’s awesome that this discussion is going on in a rapid, unfiltered way so geeks like myself can dig into the details as they emerge. On the other hand, having an un-answered commentary immediately go on-line places a tremendous amount of pressure on the other party to respond, which, coupled with lack of formal peer-review means the ideas are not always fully-formed. (<em>and yes I see the irony of complaining about not fully-formed ideas in a personal blog</em>). One possible advantage that pre-prints offer over the traditional commentary format is that the authors can spend several rounds hammering out differences internally and then release a more coherent point/counter-point. I think that’s something that would help resolve the ongoing tension and open questions here.</p>
<hr />
<p><strong>PS: Bias in REML estimates when assumptions are met</strong></p>
<p>The main disagreement stems from lack of clarity on what it means for REML assumptions to be met. Strictly speaking, <strong>REML makes only one assumption</strong>: that the phenotype is drawn from a multivariate-normal distribution with mean equal to zero and variance equal to the weighted sum of genetic relatedness matrix (GRM) and a residual. REML is then used to fit the <script type="math/tex">h^2_g</script> parameter that maximizes the likelihood in this model. Note that no additional assumptions about the disease architecture, effect-sizes, LD, or relatedness have been made. Indeed, it is the <em>interpretation</em> of this parameter that requires additional assumptions.</p>
<p>Under the standard assumption that causal effect-sizes (<script type="math/tex">\beta</script>) are i.i.d (with Gaussian residuals) and all variables were standardized, this parameter corresponds to <script type="math/tex">E[h^2_g] = \sum_j{\beta_j^2}</script> across SNPs <script type="math/tex">j</script> typed in the GRM. Again, no assumptions are made about LD, but the disease architecture is now constrained to have uncorrelated effects. [<em>In the Gaussian Process formalism, this is equivalent to having a weak prior that all SNPs are causal and then letting the data shrink the causal effects/functions.</em>] Recall that relatedness violates this assumption by inducing correlated effect sizes.</p>
<p>Under the very strict (and rarely applied) assumption that the GRM from typed SNPs is an unbiased estimator of the relatedness at all SNPs, this parameter further corresponds to the total heritability (<script type="math/tex">E[h^2_g] = h^2</script>). The SK response treats this last assumption as a requirement for the model, but it is in fact only a requirement for the <em>interpretation</em> of the quantity being estimated. According to SK, <script type="math/tex">h^2_g</script> must correspond to <script type="math/tex">h^2</script>, and they trivially show that this can be violated (for example by estimating from different chromosomes).</p>
<p>However, GCTA never makes this assumption, and SK haven’t demonstrated that the alternative definition <script type="math/tex">E[h^2_g] = \sum{\beta^2}</script> is invalid. In my opinion, defining <script type="math/tex">h^2_g</script> such that it is trivially biased and ignoring an alternative, unbiased definition is not very illuminating. Moreover, I find the GCTA interpretation highly useful precisely <em>because</em> it differs from the pedigree-based <script type="math/tex">h^2</script> quantity and estimates a meaningful property of a specific set of variants.</p>
</description>
<pubDate>Tue, 23 Feb 2016 00:00:00 -0500</pubDate>
<link>http://sashagusev.github.io/2016-02/SK-response.html</link>
<guid isPermaLink="true">http://sashagusev.github.io/2016-02/SK-response.html</guid>
<category>heritability</category>
</item>
<item>
<title>Our paper on transcriptome-wide association</title>
<description><p>Our paper on “<a href="http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3506.html">Integrative approaches for large-scale transcriptome-wide association studies</a>” is now out in Nature Genetics and I’m very proud of all the work that went into this by many people involved. Though the paper focuses on gene expression, I believe it’s addressing a broad challenge of integrating complex data with disease GWAS. Our goal here was to understand and quantify the transcriptional component of disease: which genes harbor mutations that effect expression which in turn effects phenotype. Ideally we would do this using a large cohort which had measured genetics/SNPs, and phenotype (ex: BMI), and rich expression data (ex: gene expression painstakingly collected from relevant adipose tissue). With these pieces in hand, we could associate expression with BMI to find potential causal genes, or look at fancier models involving genetic correlation/mediation to isolate the shared genetics. Unfortunately what we typically have is SNPs and expression measured in a small study, and - separately - SNPs and phenotype measured in a large GWAS for which only the summary data is available. Our paper proposes a solution based on two key insights:</p>
<ol>
<li>We can leverage the fact that SNPs are observed in both studies and predict expression from one study into the other. In practice, if we restrict to the cis locus this prediction is very accurate, and we can essentially work with the predicted expression as if we had measured it directly. This is in strong contrast to genomweide disease prediction where hundreds of thousands of samples are often still not enough to get traction.</li>
<li>In the special case where this is a linear predictor (that is, a sum of SNPs multiplied by weights) we can additionally use the fact that the relationship between SNPs (LD) is well-estimated in reference panels and infer what the predicted expression-phenotype association <em>would</em> be by only using the separately estimated relationships between (a) expression-SNP, (b) SNP-SNP/LD, and (c) SNP-phenotype. This sounds intuitive but it’s a powerful concept that allows one to do many very useful things with just summary data and LD (including estimate heritability, perform conditional analysis, and impute untyped variants).</li>
</ol>
<p>We call this a TWAS (transcriptome-wide association study), emphasizing the fact that it identifies expression-disease associations at the scale of the largest GWAS study. This approach is conceptually very appealing because the model always aggregates the cis effects into a single unit corresponding to the gene. So whether we are fine-mapping known regions associated with disease or looking for novel associations we always identify likely genes. This is quite different from a GWAS, which picks out a set of SNPs (where the mechanism is often ambiguous); or eQTL-based analyses, which require ad hoc decisions on which eQTLs to select and how to overlap them with the disease. Of course, we also show that the method is substantially more powerful than other approaches when the model assumptions are met so there’s a practical relevance here: you find new associations, and those new associations have biological meaning. There’s a lot more going on in the paper that I hope readers find interesting, but the big takeaway is that we can use computational tools to get at the biological information we wish we had even when it’s scattered across different, seemingly isolated datasets.</p>
</description>
<pubDate>Sat, 13 Feb 2016 00:00:00 -0500</pubDate>
<link>http://sashagusev.github.io/2016-02/TWAS.html</link>
<guid isPermaLink="true">http://sashagusev.github.io/2016-02/TWAS.html</guid>
<category>self-promotion</category>
</item>
<item>
<title>Hot takes: interesting papers from January</title>
<description><p><em>Intriguing papers that were published in the previous month, with highlights.</em></p>
<h2 id="heritability">Heritability</h2>
<p><strong><a href="http://www.nature.com/ng/journal/v48/n1/full/ng.3446.html">The contribution of rare variation to prostate cancer heritability, Mancuso et al. Nat Genet</a></strong></p>
<p>“Our finding that 42% (95% confidence interval = 21–63%) of the genetic risk for prostate cancer is due to variants in the MAF range of 0.1–1% is striking, given that only a couple percent of neutral varia- tion is due to SNPs in this frequency range.”</p>
<p><strong><a href="http://www.nature.com/ng/journal/v48/n1/abs/ng.3461.html">Abundant contribution of short tandem repeats to gene expression variation in humans, Gymrek et al. Nat Genet</a></strong></p>
<p>“We used variance partitioning to disentangle the contribution of eSTRs from that of linked SNPs and indels and found that eSTRs contribute 10–15% of the cis heritability [of expression] mediated by all common variants.”</p>
<p>“We hypothesize that there are more eSTRs to find in the genome…”</p>
<h2 id="population-genetics">Population genetics</h2>
<p><strong><a href="http://www.cell.com/ajhg/abstract/S0002-9297(15)00485-1">Genomic Signatures of Selective Pressures and Introgression from Archaic Hominids at Human Innate Immunity Genes, Deschamps et al. AJHG</a></strong></p>
<p>“Using full-genome sequence variation from the 1000 Genomes Project, we first show that innate immunity genes have globally evolved under stronger purifying selection than the remainder of protein-coding genes … Finally, we show that innate immunity genes present higher Neandertal intro- gression than the remainder of the coding genome.”</p>
<p><strong><a href="http://www.nature.com/ng/journal/v48/n1/full/ng.3464.html">Visualizing spatial population structure with estimated effective migration surfaces, Petkova et al. Nat Genet</a></strong></p>
<p>“EEMS is a new method for analyzing population structure from geo-referenced genetic samples. EEMS produces an intuitive visual representation of spatial patterns in genetic variation and highlights regions of higher-than-average and lower-than-average historical gene flow.”</p>
<p>“Distance matrices based on rare SNPs could also provide insights into more recent dispersal history…”</p>
<p><strong><a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005703">A Spatial Framework for Understanding Population Structure and Admixture, Bradburd et al. PLOS Gen</a></strong></p>
<p>“We use genome-wide polymorphism data to build “geo- genetic maps,” which, when applied to stationary populations, produces a map of the geo- graphic positions of the populations, but with distances distorted to reflect historical rates of gene flow.”</p>
<p>“Additionally, although we have focused on the covariance among alleles at the same locus, linkage disequilibrium (covariance of alleles among loci) holds rich information about the timing and source of admixture events as well as information about isolation by distance.”</p>
<p>“The inclusion of ancient DNA samples in the analyzed sample offers a way to get better representation of the ancestral populations from which the ancestors of modern samples received their admixture.”</p>
<h2 id="gene-expression">Gene Expression</h2>
<p><strong><a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005793">Genetic Variation, Not Cell Type of Origin, Underlies the Majority of Identifiable Regulatory Differences in iPSCs, Burrows et al. PLOS Gen</a></strong></p>
<p>“We show that the cell type of origin only minimally affects gene expression levels and DNA methylation in iPSCs (induced pluripotent stem cells), and that genetic variation is the main driver of regulatory differences between iPSCs of different donors. Our findings suggest that studies using iPSCs should focus on additional individuals rather than clones from the same individual.”</p>
<h2 id="gwas">GWAS</h2>
<p><strong><a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005803">Leveraging Genomic Annotations and Pleiotropic Enrichment for Improved Replication Rates in Schizophrenia GWAS, Wang et al. PLOS Gen</a></strong></p>
<p>“We have presented a novel algorithm, called CM3, which provides more accurate estimates of predicted replication probabilities for each SNP in a GWAS. Sorting SNPs based on predicted finite-sample replication probabilities incorporating auxiliary information, rather than by nominal p-values, yields a larger number of SNPs for a given replication threshold.”</p>
<p>“An important utility of the CM3 method may be selection of a greater proportion of relevant SNPs for gene set enrichment and biological pathway analyses…”</p>
<h2 id="big-data">Big Data</h2>
<p><strong><a href="http://www.cell.com/ajhg/references/S0002-9297(15)00491-7">Genotype Imputation with Millions of Reference Samples, Browning &amp; Browning AJHG</a></strong></p>
<p>“We demonstrate that Beagle v.4.1 scales to much larger reference panels [than IMPUTE or Minimac] by performing imputation from a simulated reference panel having 5 million samples”</p>
<p>“When there are millions of reference samples, use of a binary reference file can reduce wall clock computation time by &gt;80%.”</p>
<p>“With a reference panel containing 200,000 simulated European individuals, we find that markers with at least nine copies of the minor allele in the reference panel can be imputed with high accuracy (r2 &gt; 0.8) in target samples that have been genotyped with a 1M SNP array.”</p>
</description>
<pubDate>Thu, 04 Feb 2016 00:00:00 -0500</pubDate>
<link>http://sashagusev.github.io/2016-02/papers.html</link>
<guid isPermaLink="true">http://sashagusev.github.io/2016-02/papers.html</guid>
<category>papers</category>
</item>
<item>
<title>Gaussian process regression</title>
<description><p>Gaussian process (GP) regression is an interesting and powerful way of thinking about the old regression problem. Given the standard linear model:</p>
<script type="math/tex; mode=display">y = X\beta + \epsilon</script>
<p>where we wish to predict values of y in unlabeled test data, a typical solution is to use labeled training data to learn the <script type="math/tex">\beta</script>s (for example, by finding <script type="math/tex">\beta</script>s that minimize normally distributed residuals) and then apply them to test data to make point predictions. The GP instead describes the <script type="math/tex">y</script>s as arising from functions that have a joint multivariate-Gaussian distribution and learns the underlying mean and variance parameters. The models can be equally descriptive: just as N observations can be perfectly described by N <script type="math/tex">\beta</script>s, they can be perfectly described by N Gaussians (by centering a Gaussian at each observation and fitting a variance). But the GP offers some interesting additional flexibility, with particular relevance to genetics:</p>
<ul>
<li>By working in the space of functions instead of <script type="math/tex">\beta</script>s, the GP can express relationships between data and outcome that are intensive or impossible to describe in terms of weights (i.e. infinite functions). The GP defines a prior on the functions (using the kernel) and then allows the available training data to place restrictions on that prior before making predictions. [<em>Aside: this process of restricting the prior and making posterior predictions can also be done iteratively - previous posterior becomes current prior - as more data becomes available.</em>]</li>
<li>By working in the space of observations of <script type="math/tex">y</script>, the GP can efficiently express models where the number of predictors is greater than the number of observations (i.e. more SNPs than samples), which would be intractable for the standard linear model. Of course, there are LM solutions involving penalized regression and these have a close relationship to the GP.</li>
<li>Making predictions from a distribution instead of a linear combination of point estimates is conceptually appealing and fits well with the Bayesian formalism. The virtue of being Bayesian, like any matters of religion and politics, are probably not best discussed in a blog so I’ll only add here it also coincides with very attractive figures (see below).</li>
</ul>
<p>I’m going through the <a href="http://www.gaussianprocess.org/gpml/">Rasmussen and Williams 2006</a> book on Gaussian Processes for Machine Learning and wanted to outline here some of the basics of working with this model.</p>
<p><strong>The mathematical model.</strong> At the outset I’ll say that most of this will be paraphrasing Rasmussen and Williams (Chapter 2) with some modifications in notation to match what I’m used to. The code in R is minimal (see details at the end), but borrows greatly from <a href="http://www.jameskeirstead.ca/blog/gaussian-process-regression-with-r/">this implementation</a>.</p>
<p>Continuing from the linear model, we model <script type="math/tex">V = cov(y) = K(X,X) + \sigma_n^2 I</script>, where <script type="math/tex">K</script> is the kernel function describing the difference between observations (for simplicity, assume it’s just some N x N covariance over observations in <script type="math/tex">X</script>) and <script type="math/tex">\sigma_e^2</script> is the variance on the noise. Since our goal is prediction, unlabeled observations, <script type="math/tex">X^*</script>, can be described using the following joint distribution:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix} y \\ y_* \end{bmatrix} \sim N(0, \begin{bmatrix} V & K(X,X_*) \\ K(X_*,X) & K(X_*,X_*) \end{bmatrix} ) %]]></script>
<p>The notation is getting heavy but the intuition is that <script type="math/tex">y</script> continues to be modeled by <script type="math/tex">V</script>, and any relationships between <script type="math/tex">y</script> and <script type="math/tex">y_*</script> are modeled through variations on <script type="math/tex">K(X,X_*)</script>. This leads to the following predictive functions given the training data:</p>
<script type="math/tex; mode=display">E[y_* | X,y,X_* ] = K(X_*,X) V^{-1} y</script>
<script type="math/tex; mode=display">cov( y_* ) = K(X_*,X_*) - K(X_*,X) V^{-1} K(X,X_*)</script>
<p>Conceptually, the first equation (i.e. the prediction given the data) takes the testing data, pulls it through the relationship to the training data, and then through the relationship between <script type="math/tex">V^{-1}</script> and <script type="math/tex">y</script>, the observed labels. The variance/covariance of the prediction is not as intuitive to me, but an important point is that it only depends on the similarity of the <script type="math/tex">X</script>s and not at all on the observations (although <script type="math/tex">\sigma^2_e</script>, which may depend on the observations, is still incorporated). These two simple manipulations of the kernel and variance fully capture prediction from the GP, and are all that is necessary for a working GP prediction (all code described at the end):</p>
<pre><code class="language-R">gp_solve = function( x.train , y.train , x.pred , kernel , sigma2e = 0 ) {
solution = list()
# compute necessary covariances
k.xx = kernel(x.train,x.train)
k.x_xp = kernel(x.train,x.pred)
k.xp_x = kernel(x.pred,x.train)
k.xp_xp = kernel(x.pred,x.pred)
# Invert the covariance matrix with noise
Vinv = solve(k.xx + sigma2e * diag(1, ncol(k.xx)))
# Compute the prediction
solution[["mu"]] = k.xp_x %*% Vinv %*% y.train
solution[["var"]] = k.xp_xp - k.xp_x %*% Vinv %*% k.x_xp
return( solution )
}
</code></pre>
<p>Though I won’t transcribe it here, the fact that everything has been defined in terms of Gaussian distributions means the GP also has an explicit marginal likelihood which empowers us to do formal model comparison and testing. Of note, this is the same likelihood that is used to identify the MLE heritability parameter using GREML.</p>
<p><strong>The Gaussian process in practice.</strong> All of this makes much more sense when looking at how the model handles data. Here, I’ve generated five points from the function <script type="math/tex">y = sin(x) + \epsilon</script> and fit a prediction to these points using the GP. I’ll underscore again that in contrast to the standard (weight-space) regression, I never need to define a linear relationship between the observations <script type="math/tex">y</script> and some combination of <script type="math/tex">x</script>s (i.e. I’m not defining <script type="math/tex">y=sin(x) \beta + \epsilon</script> and looking for a <script type="math/tex">\beta</script>). Instead, I’ve selected a kernel which places a prior on the way observations are related in space (their covariance), allowed the data to constrain the prior, and sampled from the posterior. With more data I get better estimates of the multivariate-Gaussian distribution describing these points, but I don’t explicitly learn the underlying generative function. Below I’ve plotted the mean and confidence interval (2 times the sd) of the GP distribution learned from this data in blue line and blue shading:</p>
<p><img src="http://sashagusev.github.io/images/plot_gp.svg" alt="Gaussian Process Regression" /></p>
<p><em>[Aside: One subtle point that confused me is that the blue fit is not actually being plotted from the generative function with noise the way it is in a standard linear model. Rather, these lines represent draws of predicted values and their corresponding precision from the GP. This underscores the fact that the GP can model functions with infinite weights that are impossible to infer directly.]</em></p>
<p>This figure illustrates an attractive practical consequence of the GP estimating the posterior distribution: flexible confidence intervals. We don’t get just a uniform band around the mean prediction, but intervals that vary with the data. This is specifically demonstrated by the three points to the right of origin, which place a much stronger constraint on the posterior than the two points to the left. An alternative way of seeing this is that there are many more consistent functions that lie between the points on the left. This also leads to a natural way for an online algorithm to sample future points from the parts of the space that are most uncertain (for example, in the case where generating observations is very expensive).</p>
<p><strong>Kernels</strong>. Up until this point I’ve avoided talking about the <script type="math/tex">K</script>s that are used in the GP. These functions define how the <script type="math/tex">X</script>s are related to one another and have to follow certain distance-like rules: they must be symmetric; identical <script type="math/tex">X1,X2</script> values must have <script type="math/tex">K(X1,X2) = 0</script>; see more rules in <a href="http://gpss.cc/gpss15/talks/KernelDesign.pdf">this talk</a>. The kernels describe how nearby points contribute to the covariance of the outcomes, and they allow the same underlying GP machinery to flexibly use different data priors. Again, the best way to understand this is to see the output, so I’ve defined a few kernels:</p>
<pre><code class="language-R"># this is pseudocode, see iteration in code at the end
# Exponential kernel
Sigma[i,j] &lt;- exp(-1*(abs(x1[i]-x2[j])))
# Brownian kernel
Sigma[i,j] &lt;- min(x1[i],x2[j])
# Gaussian kernel
Sigma[i,j] &lt;- exp(-0.5*(x1[i]-x2[j])^2)
</code></pre>
<p>Note that each kernel is a very simple function that relates pairs of points. Below, I’ve fit these kernels to the previous points and plotted the results:</p>
<p><img src="http://sashagusev.github.io/images/plot_gp_multi_sin.svg" alt="GP fit with multiple kernels" /></p>
<p>There are a few interesting observations here. First, the trusty univariate regression can be captured in the GP using a linear kernel.</p>
<p>[<em>Aside: One may assume that the linear kernel will still have the pretty non-linear confidence intervals, but the shape of the variance follows the shape of the kernel. I imagine there can be situations where the generative model is linear, but points are drawn non-uniformly from X and the Gaussian kernel is preferred for more flexible confidence intervals.</em>]</p>
<p>Second, different kernels can yield very different distributions but tend to follow the principle that more data == more precision. Third, the kernel that’s closest to the generative model - in this case <script type="math/tex">sin(x)</script> - fits the data best.</p>
<p>Let’s see how each kernel deals with data that is instead sampled from a linear function <script type="math/tex">y = x + \epsilon</script>:</p>
<p><img src="http://sashagusev.github.io/images/plot_gp_multi_linear.svg" alt="GP fit of linear data with multiple kernels" /></p>
<p>Somewhat surprisingly, the complex kernels still do a pretty good job of fitting the linear function in parts of the space that have labelled observations. On the other hand, where data has not been observed (-4 &gt; X &gt; 4) the confidence intervals quickly expand and the prediction reverts to the prior implied by the kernel. This is even mildly true for the linear kernel, which raises an important point that the GP is not a magic bullet. If we knew that the data was really coming from a simple linear function of X, then the boring OLS - with it’s frumpy uniform confidence intervals - is exactly what we would want. That would give us great, confident predictions at x = 1,000 where the linear GP has never seen labelled data and would be highly uncertain. This is probably obvious, but model flexibility is only valuable if the underlying data necessitates it.</p>
<p><strong>Closing thoughts</strong>. I’m still learning and this is all just scratching the surface of what can be done with the Gaussian process framework. The underlying model can be extended to have multiple outcomes with specific covariance structure (e.g. time series or correlated phenotypes); multiple GPs placing individual priors on partitions of the feature space; approximations to non-Gaussian outputs (e.g. poisson processes or binary classification); and apparently an entire language of kernel functions. In a future post, I’ll attempt to draw explicit connections between this model and work in genetics and consider relevant extensions.</p>
<p>I’ve been negligent about citations here, but much good reading on this topic is openly available on the <a href="www.gaussianprocess.org">gaussianprocess.org web-site</a>, the <a href="http://www.gaussianprocess.org/gpml/chapters/">GPML book</a>, and the Gaussian Process Summer Schools <a href="http://gpss.cc/gpss15/">lectures</a>. Useful applied examples are also available with the Python <a href="http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/index.ipynb">GPy</a> and <a href="http://scikit-learn.org/stable/modules/gaussian_process.html">scikit - GPML</a> libraries.</p>
<h2 id="code">Code</h2>
<p>The full code to solve a GP regression and generate all figures in this post is available in this Gist:</p>
<p><a href="https://gist.github.com/sashagusev/c9287f488cd65c3ede9e">https://gist.github.com/sashagusev/c9287f488cd65c3ede9e</a></p>
</description>
<pubDate>Sun, 24 Jan 2016 00:00:00 -0500</pubDate>
<link>http://sashagusev.github.io/2016-01/GP.html</link>
<guid isPermaLink="true">http://sashagusev.github.io/2016-01/GP.html</guid>
<category>regression</category>
</item>
</channel>
</rss>