-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.qmd
827 lines (609 loc) · 26.4 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
---
webr:
show-startup-message: false
packages: ['tibble', 'dplyr', 'feather']
format:
revealjs:
include-in-header:
- file: js_scripts.html
theme: [default, theme.scss]
scrollable: true
css: styles.css
transition: slide
highlight-style: github
slide-number: true
sc-sb-title: true
logo: images/quarto.png
footer-logo-link: "https://quarto.org"
chalkboard: true
resources:
- British_National_Corpus_Vector_size_300_skipgram/model.bin
bibliography: references.bib
csl: diabetologia.csl
filters:
- webr
- reveal-header
slide-level: 4
number-sections: false
engine: knitr
---
# [An Exploration of Information Loss in Transformer Embedding Spaces for Enhancing Predictive AI in Genomics]{
style="color:#FFFFFF;
font-size: 55px;
position: relative;
bottom: 60px;
text-shadow: -1px -1px 0 #000,
1px -1px 0 #000,
-1px 1px 0 #000,
1px 1px 0 #000;"
}{background-image="images/DALL_E_DNA_Image_edited.png" background-size="cover" background-color="#4f6952"}
<h1 style="color:#FFFFFF;
font-size: 28px;
text-align: center;
position: absolute;
top: 325px;
width: 100%;
text-shadow: -0.75px -0.75px 0 #000,
0.75px -0.75px 0 #000,
-0.75px 0.75px 0 #000,
0.75px 0.75px 0 #000;">Daniel Hintz <br> 2024-06-06</h2>
# Thesis Question
. . .
::: {style="text-align:center; font-size: 0.85em;"}
How easily can information be extracted from GenSLM embeddings while maintaining its original integrity; and what is the quality of the produced vector embeddings?
:::
::: {.notes}
- The jargon won't have meaning now, will very soon, i.e.
- What is GenSLM
- What is an embedding
- Why do we care about the quality of an embedding
- What are real world applications
:::
# [Outline]{style="font-size: 55px;"}
::: {style="font-size: 0.65em;"}
- Thesis Question
- Background
- DNA
- Motivation
- GenSLM Embedding Algorithm
- Data and Processing
- Source and Cleaning
- Methodology
- Intrinsic and Extrinsic Evaluation
- Results
- Conclusion
:::
# [Background]{style="color:white;"}{background-color="#8FA9C2"}
## DNA
::: columns
::: {.column width="80%"}
::: incremental
::: {style="font-size: 0.6em;"}
- Deoxyribonucleic acid (DNA) is a molecule that contains the genetic code that provides the instructions for building and maintaining life.
- The structure of DNA can be thought of as rungs on a ladder (known as base pairs) involving the pairing of four nucleotides - Adenine (A), Cytosine (C),Guanine (G) and Thymine (T).
:::
:::
:::
::: {.column width="20%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 1: DNA Base Pairs; [@NHGRI2024]](images/DNA_diagram.png)
:::
:::
:::
## DNA Sequencing Technology
::: columns
::: {.column width="50%"}
::: incremental
::: {style="font-size: 0.6em;"}
- Someone gets tested for Covid, PCA is ran to detect if the assay is indeed positive.
- Positive tests are sequenced using a machine that is most likely either Ilumina or Oxford Nanopore.
- This takes in the Covid Sample and arrives at a digital copy of DNA sequences.
:::
:::
:::
::: {.column width="50%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 2: Simplified Schema of Sequencing Process for SARS-CoV-2](images/DNA_sequencing.png)
:::
:::
:::
## SARS-CoV-2 and Proteins
::: columns
::: {.column width="40%"}
::: incremental
::: {style="font-size: 0.6em;"}
- SARS-CoV-2 has 29 proteins.
- Different proteins have different functions.
- Proteins are encoded from different sites of DNA.
- In studying embeddings, tasks have been either **whole genome oriented** or **protein specific**.
<!--
- In the context of SARS-CoV-2, proteins are responsible for the virus's structure and infectivity.
- Nonstructural Proteins primarily function in the replication and manipulation of the host cell's environment to favor viral replication.
-->
:::
:::
:::
::: {.column width="60%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 3: SARS-CoV-2 Proteins; Adapted from [@kandwal2023genetic], pg. 99](images/nsp1_str_non_str.png)
:::
:::
:::
## Mutations
::: columns
::: {.column width="40%"}
::: incremental
::: {style="font-size: 0.6em;"}
- A mutation in the context of DNA refers to a change in the nucleotide sequence of the genetic material of an organism.
- Mutations can be random or induced by external factors.
- For contect, SARS-CoV-2 can mutate to become more infective for increasing its chances of survival.
- There many different types of mutations but only substitutions are illustrated for this talk.
:::
:::
:::
::: {.column width="60%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 4: Substitution Mutations](images/mutation_example.png){width="45%"}
:::
:::
:::
:::{.notes}
- Spontaneous mutations: These arise naturally and randomly without any external influence, often due to errors in DNA replication. Errors might include mispairing of nucleotides or slippage during replication.
- Induced mutations: These are caused by external factors, known as mutagens, which can include physical agents like ultraviolet light and X-rays, or chemical agents such as certain drugs and pollutants.
**Why Mutations Occur**
Random errors: Many mutations result simply from random errors during DNA replication. DNA polymerase, the enzyme responsible for replicating DNA, is not perfect and occasionally makes mistakes.
Environmental stress: Exposure to environmental stressors like radiation, chemicals, and even biological agents can damage DNA and lead to mutations.
Evolutionary advantage: From an evolutionary perspective, mutations are a key mechanism of genetic variability and thus, evolution. While many mutations are neutral or harmful, some can confer advantages that lead to positive selection in a population.
:::
## Motivation
::: incremental
::: {style="font-size: 0.6em;"}
- This project came about from connecting with Nick Chia, a biophysicist at Argonne National labs.
- Nick and his collaborators in 2020 (then at the Mayo Clinic) made an algorithm that predicted the trajectory of colorectal cancer mutations.
- However, this algorithm was slow and expensive to train.
- **The embeddings used in this algorithm was one of the biggest choke points**.
- New and improved ways are needed to embed large genomes into lower dimensions
- For example, a one hot encoding for the human genome as DNA dimensionally is n x 3 billion x 4, where n is the number of patients, 3 billion is the number of rows for each nucleotide and 4 columns represent the four possible bases (A, T, G, C).
- This is the **Big P little N problem**; hence dimension reduction via an embedding is extremely advantageous!
- Further work needs to be done is studying the quality of embeddings to help bring about more efficient real world implementations like precision medicine for colorectal cancer.
:::
:::
## Dimension Reduction
::: {style="font-size: 0.6em;"}
In the UW Statistics program what methods do we learn to apply dimension reduction?
:::
::: incremental
::: {style="font-size: 0.6em;"}
- Principal component analysis (PCA)!
- PCA as well as embeddings, transform the coordinate system of our data.
- The difference for this project is that dimension reduction is being applied to DNA.
:::
:::
## Embeddings
::: incremental
::: {style="font-size: 0.55em;"}
- An *embedding* is the name for the dense numeric vectors produced by *embedding algorithms*, such as GenSLM.
- "Embedding" generally refers to the embedding vector.
- Embeddings are representation of data in lower-dimensional space that preserves some aspects of the original structure of the data.
- But why would you ever embed something?
- Embedding vectors are generally more computationally efficient.
:::
:::
::: {.notes}
- As an introduction, it will be easier to first walk through an example of an Embedding for natural language.
:::
## Sequence Embeddings
::: incremental
::: {style="font-size: 0.6em;"}
- Neural network embedding algorithms for genomic data include **GenSLM (the focus of this presentation)**, DNABERT2, and HyenaDNA.
- These embeddings are lossy encodings, meaning that some information is lost in the transformation.
- They also can distort structural relationships represented in the data.
:::
:::
::: {.notes}
- Just like we can represent words as vectors we can also represent genomic sequences as vectors.
:::
<!--
## Why Data Representation Matters
::: incremental
::: {style="font-size: 0.6em;"}
- An embedding is one way of representing data.
:::
:::
. . .
::: {style="text-align:center; font-size: 0.5em;"}
![](images/pred_rep_obj.png){width="65%"}
:::
::: {style="margin-top: 20px;"}
:::
::: incremental
::: {style="font-size: 0.6em;"}
- An effective representation (embedding) of a genomic sequence is crucial for all downstream tasks.
- i.e., clustering, classification, regression, protein function identification, structural analysis, and predicting genetic disorders [@angermueller2016deep].
- Given the increasing use of embeddings, embedding evaluation has also become more important.
:::
:::
-->
## GenSLM
::: panel-tabset
### Overview
::: columns
::: {.column width="40%"}
::: incremental
::: {style="font-size: 0.6em;"}
- Overall, the GenSLM algorithm first **tokenizes**, i.e., breaks up sequences in to chunks of three nucleotides.
- Then, the input sequence are **vectorized** creating a $1 \times 512$ vector for each inputed sequence.
:::
:::
:::
::: {.column width="60%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 5: Generic Embedding Workflow](images/Tokenize_vectorize.png){width="50%"}
:::
:::
:::
### Transformer
::: columns
::: {.column width="40%"}
::: incremental
::: {style="font-size: 0.6em;"}
- First the tokenized input sequence is passed to the transformer encoder.
- The transformer encoder converts 1,024 bp slices into numeric vectors.
- This is ran recursively through a diffusion model to learn a condensed distribution of the whole sequence.
- The transformers in GenSLM are used to capture local interactions within a genomic sequence.
:::
:::
:::
::: {.column width="60%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 6: GenSLM Transformer Architecture; [@genslms2023]](images/GenSLM.png)
:::
:::
:::
### Diffusion + Transformer
::: incremental
::: {style="font-size: 0.6em;"}
- Diffusion and Transformer models work in tandem in GenSLM.
- Transformers captures local interactions, while the diffusion model captures learns a representation of transformers vectors to represent the whole genome.
:::
:::
:::
# [Downstream]{style="color:white;"}{background-color="#8FA9C2"}
## Downstream
::: incremental
::: {style="font-size: 0.6em;"}
- Consider our colereactal cancer example, in the context of precision medicine, you may want to classify aggressive and non-aggressive cancers.
- Or in the context of SARS-CoV-2, you may want to be able to monitor changes in variants and how they differ from previous variants.
- Assuming embeddings in both examples, the tasks are considered **downstream** from the process of creating the embeddings from the original data.
:::
:::
## Raw DNA Versus Embedded Data
```{r}
library(DT)
variant_data_subset <- read.csv("https://raw.githubusercontent.com/DHintz137/Embedding_Presentation/main/raw-data/variant_data_subset.csv")
font.size <- "10pt"
variant_data_subset %>%
DT::datatable(
options=list(
initComplete = htmlwidgets::JS(
"function(settings, json) {",
paste0("$(this.api().table().container()).css({'font-size': '", font.size, "'});"),
"}")
)
)
```
## Project Workflow
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 7: Project Workflow](images/Overall_workflow.png){width="30%"}
:::
# [Data and Processing]{style="color:white;"}{background-color="#8FA9C2"}
## Data Description
::: incremental
::: {style="font-size: 0.6em;"}
- All Covid sequences were downloaded with permission from the Global Initiative on Sharing All Influenza Data (GISAID).
- No geographical or temporal restrictions were placed on the sequences extracted.
:::
:::
## Data Cleaning
::: incremental
::: {style="font-size: 0.6em;"}
- GISAID's exclusion criteria was used to remove sequences of poor quality.
- Additional exclusion criteria was implemented to remove all sequence with ambiguous nucleotides (the `NA's` of the genomic world) and large gap sizes post-alignment.
:::
:::
## Multiple Sequence Alignment
::: incremental
::: {style="font-size: 0.6em;"}
- An multiple alignment alignment was performed to be able to slice the exact location of proteins across different sequences.
:::
:::
. . .
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 18: Un-aligned Sequences](images/Unaligned_example.png){width="30%"}
:::
. . .
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 19: Aligned Sequences](images/aligned_example.png){width="30%"}
:::
# [Exploratory Data Analysis]{style="color:white;"}{background-color="#8FA9C2"}
<!--
## [What are the Patterns and Structures Oberved from the Sequence Data?]{style="font-size: 52px;"}
::: incremental
::: {style="font-size: 0.6em;"}
- For Exploratory Data Analysis, the goal was the characterize the structure of the data in its sequence representation.
- This laid the groundwork for subsequent comparisons with the embedding structure.
:::
:::
. . .
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 8: Mutation Boxplot](images/Whole_Genome_Mutation_Count_Hamming_Distances.png){width="50%"}
:::
-->
##
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 11: Hamming Distance of Aligned Proteins to Wuhan Reference Sequence](images/hamming_prot_heatmap_581_all_variants.png){width="100%"}
:::
# [Methodology]{style="color:white;"}{background-color="#8FA9C2"}
## Evaluating Embeddings
::: incremental
::: {style="font-size: 0.6em;"}
- Broadly speaking, the quality of an embedding is assessed based upon its information richness and the degree of non-redundancy.
- Their are two main methodologies for evaluating embeddings:
- When downstream learning tasks are evaluated, it’s referred to as **extrinsic evaluation** [@lavravc2021representation].
- When the qualities of the embedding matrix itself are assessed it is referred to as **intrinsic evaluation** [@lavravc2021representation].
:::
:::
## Intrinsic Evaluation
::: incremental
::: {style="font-size: 0.6em;"}
- There are three main qualities to asses when studying the the quality of an embedding: **Redundancy**, **Separability**, **Preservation of Semantic Distance**
- Redundancy reveals the efficiency of a embeddings encoding; more efficient encodings tend to perform better and use fewer computational resources.
- Separability gives practical insight into whether or not the embedding output can be separated into meaningful genomic groups (i.e., variants).
- Preservation of semantic distance is important for determining if the structural representation of information remains valid for subsequent tasks.
- Sub-methods used for intrinsic evaluation include Singular Value Decomposition (SVD), Distance matrices, Radial Dendograms, and Principal Copnonent Analysis.
:::
:::
## Extrinsic Evaluation
::: incremental
::: {style="font-size: 0.6em;"}
- Extrinsic evaluation pursues practical benchmarks to assess the performance of an embeddings algorithm per its associated embedding matrix across varying levels of task complexity.
- For GenSLM, the subset of possible tasks chosen were variant and protein classification.
- A Classification and Regression Tree (CART) was used for performing the classification.
:::
:::
# [Results]{style="color:white;"}{background-color="#8FA9C2"}
## [Intrinsic Evaluation Results]{style="color:#6FB1D1;"}
### [Redundancy]{style="color:#A9D0E3;"}
#### [Singular Value Decomposition]{style="font-size: 0.8em;"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 13: SVD CPE Plot](images/SVD_582_cpe_2.png){width="92%"}
:::
### [Preservation of Semantic Distance]{style="color:#A9D0E3;"}
#### Distance Matrices
<!--
![](/images/variant_labels/alpha.png){.absolute top=675 left=110 width="20.55" height="16.91"}
-->
<!--X axis labels-->
<!--
![](images/variant_labels/beta.png){.absolute top=675 left=228 width="18.17" height="29.48"}
![](images/variant_labels/delta.png){.absolute top=675 left=340 width="14.64" height="24.91"}
![](images/variant_labels/epsilon.png){.absolute top=675 left=455 width="15.74" height="21.83"}
![](images/variant_labels/gamma.png){.absolute top=675 left=575 width="20.07" height="25.14"}
![](images/variant_labels/mu.png){.absolute top=675 left=692 width="20.71" height="25.14"}
![](images/variant_labels/omicron.png){.absolute top=675 left=815 width="21.45" height="22.91"}
-->
<!--Y axis labels-->
<!--
![](images/variant_labels/alpha_90degrees.png){.absolute top=630 left=37.5 width="16.91" height="20.55"}
![](images/variant_labels/beta_90degrees.png){.absolute top=580 left=32 width="29.48" height="18.17"}
![](images/variant_labels/delta_90degrees.png){.absolute top=538 left=32 width="24.91" height="14.64"}
![](images/variant_labels/epsilon_90degrees.png){.absolute top=490 left=32 width="21.83" height="15.74"}
![](images/variant_labels/gamma_90degrees.png){.absolute top=440 left=32 width="25.14" height="20.07"}
![](images/variant_labels/mu_90degrees.png){.absolute top=390 left=32 width="25.14" height="20.71"}
![](images/variant_labels/omicron_90degrees.png){.absolute top=340 left=32 width="22.91" height="21.45"}
-->
::: panel-tabset
##### Sequence
::: {style="margin-top: 0px;"}
:::
::: {style="font-size: 0.6em;"}
- There is a lot of heterogeneity in the differences among sequences as fine scales.
- Notice the L shape in striations in alpha sequences compare to other variants.
:::
<iframe class="plotlyFrame" src="./seq_dist_matrix.html" style="border: none;"></iframe>
##### Embeddings
::: {style="margin-top: 5.5px;"}
:::
::: {style="font-size: 0.6em;"}
- Comparing the Sequence distance matrix to the Embedding distance matrix we see that finer differences amongst sequences are lost in the embedding transformation.
:::
<iframe class="plotlyFrame" src="./emb_dist_matrix.html" style="border: none;"></iframe>
##### Abs. Diff.
::: {style="margin-top: 5px;"}
:::
::: {style="font-size: 0.6em;"}
- The absolute difference matrix is the absolute difference of the sequence distance matrix and the embedding distance matrix.
:::
<iframe class="plotlyFrame" src="./abs_diff_dist_matrix.html" style="border: none;"></iframe>
##### Reg. Diff.
::: {style="margin-top: 18px;"}
:::
::: {style="font-size: 0.6em;"}
- The difference distance matrix is the sequence distance matrix minus the embedding distance matrix.
:::
::: {style="margin-top: 18px;"}
:::
<iframe class="plotlyFrame" src="./diff_dist_matrix.html" style="border: none;"></iframe>
:::
#### Radial Dendrograms (1)
::: incremental
::: {style="font-size: 0.6em;"}
- Radial dendrograms show who easily sequences cluster by their variant.
- Figures 12 and 13 in the next two slides show:
- Radial dendrograms using sequence data does not perfectly clusters sequences by variants.
- Radial dendrograms using the embedded data perfectly clusters sequences by variants.
:::
:::
#### Radial Dendrograms (2)
![](images/Radial_dendrograms.png)
### [Separability]{style="color:#A9D0E3;"}
### Principal Component Analysis
::: {style="margin-top: 70px;"}
:::
<iframe class="plotlyFrame" src="./v_pc_plotly.html" style="border: none;"></iframe>
## [Extrinsic Evaluation Results]{style="color:#6FB1D1;"}
### [Variant Classification]{style="color:#A9D0E3;"}
#### [Variant Embedding Classification (1)]{style="font-size: 0.8em;"}
::: incremental
::: {style="font-size: 0.5em;"}
- Variant classification with a CART learner had a 97.71% on aligned sequence embeddings.
:::
:::
. . .
![Figure 17: Variant Missclassifcations](images/Variant_CART_tree.png){width="67%"}
#### Variant Embedding Classification (2)
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 17: Variant Missclassifcations](images/Variant_missclassifcations.png){width="40%"}
:::
. . .
::: {style="font-size: 0.6em;"}
- Imagine if we could classify aggressive and non-aggressive cancer with this kind of accuracy.
:::
#### Variant One-Hot-Encoding Classification
. . .
::: incremental
::: {style="font-size: 0.6em;"}
- Variant classification with a CART learner had a 10.28% on One-Hot-Encoded sequences.
:::
:::
. . .
{{< pdf images/CART_variant_binary_cropped.pdf width=900 height=400 >}}
:::{.notes}
- one-hot encoding Covid sequences (assuming all sequences are of length 30,000) creates a one dimensional array of $n \times 30,000 \times 4$, 4 columns representing 4 different bases
- A: \([1, 0, 0, 0]\)
- C: \([0, 1, 0, 0]\)
- G: \([0, 0, 1, 0]\)
- T: \([0, 0, 0, 1]\)
where a there is 30,000 rows to represent all 30,000 nucleotides in a Covid sequence; thus with n matrices of $30,000 \times 4$ comprising the multidimensional array for each patient (ie each whole genome sequence)
:::
<!--
### Protein Classification
. . .
::: panel-tabset
#### Embedding
::: {style="font-size: 0.5em;"}
- 100% classification rate using the all 512 dimensions from the GenSLM embedding.
:::
{{< pdf images/protein_CART.gv.pdf width=900 height=400 >}}
#### PCA Reduced Embedding
::: {style="font-size: 0.5em;"}
- 100% classification rate using the PCA reduced feature matrix with 28 dimensions.
:::
{{< pdf images/protein_CART_29_PCA_components.gv.pdf width=900 height=400 >}}
:::
-->
## Thesis Question
::: {style="text-align:center; font-size: 0.85em;"}
How easily can information be extracted from GenSLM embeddings while maintaining its original integrity; and what is the quality of the produced vector embeddings?
:::
## Conclusion
::: incremental
::: {style="font-size: 0.6em;"}
- The study assessed GenSLM's embeddings for quality using intrinsic and extrinsic evaluations, focusing on redundancy, separability, and information preservation.
- While Covid DNA can be highly related with a little as 0.067% differentiating some variates, the GenSLM embedding did indctaed strong performance.
- GenSLM demonstrated strong performance in classification tasks, with a 97.71% success rate for variant classification and 100% for proteins using the CART learner, significantly outperforming One-Hot-Encoded matrices.
- GenSLM aslo excelled in separability.
- However , GenSLM also displayed suboptimal features
- GenSLM was not effective in preserving fine-scale genetic differences and distorted the genetic distances between certain viral variants.
- Overall, GenSLM has room to improve considering its dimensional redundancy and only moderate success in preserving semantic distances.
- Future research should compare GenSLM with other neural network embedding algorithms and explore its utility in more diverse genomic analysis tasks to establish broader applicability.
:::
:::
# `r fontawesome::fa("hand-holding-heart", "white")` [Acknowledgements]{style="color:white; font-size: 70px;"} {background-color="#8FA9C2"}
::: incremental
::: {style="color:white; font-size: 0.6em;"}
- A big thank you to my Committee Tim Robinson, Shaun Wulff, Sasha Skiba and Nick Chia.
- An Extra special thank you Nick for sticking by me!
- Robert Petit for Introducing me to Bioinformatics!
- Liudmila Mainzer for all your support, training and guidance!
- My Cohort: Allie Midkiff, Oisín O'Gailin, Austin Watson, Sandra Biller, Austin Watson, Daiven Francis, and Joe Crane.
- My partner Hana for all your support and patience!
- And another big thank you to Tim!
:::
:::
# [Thank you!!]{.center style="color:white;"}{background-color="#8FA9C2"}
# [Appendix]{style="color:white;"}{background-color="#8FA9C2"}
##
{{< pdf DHintz_PlanB_Final.pdf width=1000 height=650 >}}
## Word Embeddings
::: incremental
::: {style="font-size: 0.6em;"}
- Before we introduce GenSLM, lets look at a simpler application in natural language.
- If we consider English text, **How can we measure the similarity between words?**
- For example, what is the semantic difference between the words "King" and "Woman".
- Let's demonstrate the semantic relationships of “King”, “Queen”, “Man”, “Woman”.
:::
:::
::: {.notes}
- Embedding algorithms need data (a corpus of text) to construct numeric vectors relating words and their meaning.
:::
## Word2vec Example
. . .
```{webr-r}
#| context: setup
options(max.print = 15)
# Download a dataset
url <- "https://raw.githubusercontent.com/DHintz137/Embedding_Presentation/main/raw-data/King_Man_Woman_Queen_pred.csv"
download.file(url, "King_Man_Woman_Queen_pred.csv")
embedding_raw <- read.csv("King_Man_Woman_Queen_pred.csv")
embedding <- t(data.frame(embedding_raw[,-1] ,row.names = embedding_raw[,1]))
colnames(embedding) <- c("King", "Man", "Woman", "Queen")
# embedding <- as_tibble(embedding)
print_tidy <- function(x, ...) {
if (is.matrix(x) || length(dim(x)) > 1) {
# If x is a matrix or array, convert each column into a tibble column
out <- as_tibble(as.data.frame(x))
} else if (is.vector(x)) {
# If x is a vector, convert it into a single column tibble
out <- tibble(value = x)
} else {
# Fallback for other data types
out <- as_tibble(x)
}
print(out, ...)
}
```
```{r, echo=TRUE, eval=FALSE}
#| class-source: small-code
#| classes: small-code
## Pre-Ran Code ##
library(word2vec)
# using British National Corpus http://vectors.nlpl.eu/repository/
model <- read.word2vec("British_National_Corpus_Vector_size_300_skipgram", normalize = TRUE)
embedding <- predict(model,newdata = c("king_NOUN","man_NOUN","woman_NOUN","queen_NOUN"),type="embedding")
```
::: {style="margin-top: 20px;"}
:::
. . .
```{webr-r}
#| class-source: small-code
#| classes: small-code
print_tidy(embedding)
print_tidy(embedding[ ,"King"] - embedding[ , "Man"] + embedding[ , "Woman"])
print_tidy(t(embedding[ , "Man"]) %*% embedding[ , "Woman"])
print_tidy(t(embedding[ , "Man"]) %*% embedding[ , "King"])
```
## Plotting Word Embeddings
. . .
::: {style="font-size: 0.6em;"}
- We can Plot the vectors shown in the previous slide to demonstrate how word2vec produces embeddings that measures semantic similarity between words.
:::
. . .
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 4: Word Embedding Vectors ](images/King_Man_Woman_Queen.png){width="70%"}
:::
#### Kullback–Leibler Divergence (KLD)
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 15: CKL vs DKL](images/CKL_vs_DKL_ver2.png){width="85%"}
:::
# [References]{.center style="color:#464D58;"}{background-color="#8FA9C2"}