index.qmd

---
webr: 
  show-startup-message: false  
  packages: ['tibble', 'dplyr', 'feather']
format:
  revealjs:
    include-in-header:
      - file: js_scripts.html
    theme: [default, theme.scss]
    scrollable: true
    css: styles.css
    transition: slide
    highlight-style: github
    slide-number: true
    sc-sb-title: true
    logo: images/quarto.png
    footer-logo-link: "https://quarto.org"
    chalkboard: true
    resources:
      - British_National_Corpus_Vector_size_300_skipgram/model.bin
bibliography: references.bib
csl: diabetologia.csl
filters: 
 - webr
 - reveal-header
slide-level: 4
number-sections: false
engine: knitr
---

# [An Exploration of Information Loss in Transformer Embedding Spaces for Enhancing Predictive AI in Genomics]{
  style="color:#FFFFFF; 
         font-size: 55px; 
         position: relative;
         bottom: 60px; 
         text-shadow: -1px -1px 0 #000, 
                      1px -1px 0 #000, 
                      -1px 1px 0 #000, 
                      1px 1px 0 #000;"
}{background-image="images/DALL_E_DNA_Image_edited.png" background-size="cover" background-color="#4f6952"}


<h1 style="color:#FFFFFF; 
         font-size: 28px; 
         text-align: center;
         position: absolute; 
         top: 325px; 
         width: 100%; 
         text-shadow: -0.75px -0.75px 0 #000, 
                      0.75px -0.75px 0 #000, 
                      -0.75px 0.75px 0 #000, 
                      0.75px 0.75px 0 #000;">Daniel Hintz <br> 2024-06-06</h2>


# Thesis Question

. . .

::: {style="text-align:center; font-size: 0.85em;"}
How easily can information be extracted from GenSLM embeddings while maintaining its original integrity; and what is the quality of the produced vector embeddings?
:::


::: {.notes}
- The jargon won't have meaning now, will very soon, i.e.
    - What is GenSLM
    - What is an embedding 
    - Why do we care about the quality of an embedding 
    - What are real world applications  
:::

# [Outline]{style="font-size: 55px;"} 

::: {style="font-size: 0.65em;"}
- Thesis Question
- Background
    - DNA
    - Motivation
    - GenSLM Embedding Algorithm 
- Data and Processing
    - Source and Cleaning 
- Methodology
    - Intrinsic and Extrinsic Evaluation
- Results
- Conclusion
:::


# [Background]{style="color:white;"}{background-color="#8FA9C2"}

## DNA

::: columns
::: {.column width="80%"}
::: incremental
::: {style="font-size: 0.6em;"}
- Deoxyribonucleic acid (DNA) is a molecule that contains the genetic code that provides the instructions for building and maintaining life.
- The structure of DNA can be thought of as rungs on a ladder (known as base pairs) involving the pairing of four nucleotides - Adenine (A), Cytosine (C),Guanine (G) and Thymine (T).


:::
:::
:::
::: {.column width="20%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 1: DNA Base Pairs; [@NHGRI2024]](images/DNA_diagram.png)
:::
:::
:::

## DNA Sequencing Technology

::: columns
::: {.column width="50%"}
::: incremental
::: {style="font-size: 0.6em;"}
- Someone gets tested for Covid, PCA is ran to detect if the assay is indeed positive.
- Positive tests are sequenced using a machine that is most likely either Ilumina or Oxford Nanopore.
- This takes in the Covid Sample and arrives at a digital copy of DNA sequences.
:::
:::
:::
::: {.column width="50%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 2: Simplified Schema of Sequencing Process for SARS-CoV-2](images/DNA_sequencing.png)
:::
:::
:::


## SARS-CoV-2 and Proteins

::: columns
::: {.column width="40%"}
::: incremental
::: {style="font-size: 0.6em;"}
- SARS-CoV-2 has 29 proteins.
- Different proteins have different functions.
- Proteins are encoded from different sites of DNA.
- In studying embeddings, tasks have been either **whole genome oriented** or **protein specific**. 
<!--
- In the context of SARS-CoV-2, proteins are responsible for the virus's structure and infectivity.
-  Nonstructural Proteins primarily function in the replication and manipulation of the host cell's environment to favor viral replication.
-->
:::
:::
:::
::: {.column width="60%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 3: SARS-CoV-2 Proteins; Adapted from [@kandwal2023genetic], pg. 99](images/nsp1_str_non_str.png)
:::
:::
:::

## Mutations

::: columns
::: {.column width="40%"}
::: incremental
::: {style="font-size: 0.6em;"}
- A mutation in the context of DNA refers to a change in the nucleotide sequence of the genetic material of an organism. 
- Mutations can be random or induced by external factors. 
- For contect, SARS-CoV-2 can mutate to become more infective for increasing its chances of survival.
- There many different types of mutations but only substitutions are illustrated for this talk.
::: 
::: 
:::
::: {.column width="60%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 4: Substitution Mutations](images/mutation_example.png){width="45%"}
:::
:::
:::

:::{.notes}
- Spontaneous mutations: These arise naturally and randomly without any external influence, often due to errors in DNA replication. Errors might include mispairing of nucleotides or slippage during replication.
- Induced mutations: These are caused by external factors, known as mutagens, which can include physical agents like ultraviolet light and X-rays, or chemical agents such as certain drugs and pollutants.

**Why Mutations Occur**

Random errors: Many mutations result simply from random errors during DNA replication. DNA polymerase, the enzyme responsible for replicating DNA, is not perfect and occasionally makes mistakes.

Environmental stress: Exposure to environmental stressors like radiation, chemicals, and even biological agents can damage DNA and lead to mutations.

Evolutionary advantage: From an evolutionary perspective, mutations are a key mechanism of genetic variability and thus, evolution. While many mutations are neutral or harmful, some can confer advantages that lead to positive selection in a population.
:::

## Motivation 

::: incremental
::: {style="font-size: 0.6em;"}
- This project came about from connecting with Nick Chia, a biophysicist at Argonne National labs.
- Nick and his collaborators in 2020 (then at the Mayo Clinic) made an algorithm that predicted the trajectory of colorectal cancer mutations.
- However, this algorithm was slow and expensive to train. 
- **The embeddings used in this algorithm was one of the biggest choke points**.
- New and improved ways are needed to embed large genomes into lower dimensions
    - For example, a one hot encoding for the human genome as DNA dimensionally is n x 3 billion x 4, where n is the number of patients, 3 billion is the number of rows for each nucleotide and 4 columns represent the four possible bases (A, T, G, C).
- This is the **Big P little N problem**; hence dimension reduction via an embedding is extremely advantageous!
- Further work needs to be done is studying the quality of embeddings to help bring about more efficient real world implementations like precision medicine for colorectal cancer. 
:::
:::

## Dimension Reduction 

::: {style="font-size: 0.6em;"}
In the UW Statistics program what methods do we learn to apply dimension reduction?
::: 

::: incremental
::: {style="font-size: 0.6em;"}
- Principal component analysis (PCA)!
- PCA as well as embeddings, transform the coordinate system of our data.
- The difference for this project is that dimension reduction is being applied to DNA.
::: 
::: 

## Embeddings

::: incremental
::: {style="font-size: 0.55em;"}
 - An *embedding* is the name for the dense numeric vectors produced by *embedding algorithms*, such as GenSLM.
 
 -  "Embedding" generally refers to the embedding vector.
 
 -  Embeddings are representation of data in lower-dimensional space that preserves some aspects of the original structure of the data.

- But why would you ever embed something?
    - Embedding vectors are generally more computationally efficient.
:::
:::

::: {.notes}
- As an introduction, it will be easier to first walk through an example of an Embedding for natural language.
:::


## Sequence Embeddings

::: incremental
::: {style="font-size: 0.6em;"}
- Neural network embedding algorithms for genomic data include **GenSLM (the focus of this presentation)**, DNABERT2, and HyenaDNA.
- These embeddings are lossy encodings, meaning that some information is lost in the transformation.
- They also can distort structural relationships represented in the data.
:::
:::

::: {.notes}
- Just like we can represent words as vectors we can also represent genomic sequences as vectors.
:::

<!--
## Why Data Representation Matters 

::: incremental
::: {style="font-size: 0.6em;"}
- An embedding is one way of representing data.
:::
:::

. . .

::: {style="text-align:center; font-size: 0.5em;"}
![](images/pred_rep_obj.png){width="65%"}
:::

::: {style="margin-top: 20px;"}
:::

::: incremental
::: {style="font-size: 0.6em;"}
-  An effective representation (embedding) of a genomic sequence is crucial for all downstream tasks.
    - i.e., clustering, classification, regression, protein function identification, structural analysis, and predicting genetic disorders [@angermueller2016deep].
- Given the increasing use of embeddings, embedding evaluation has also become more important. 
:::
:::
-->

## GenSLM

::: panel-tabset
### Overview

::: columns
::: {.column width="40%"}
::: incremental
::: {style="font-size: 0.6em;"}
- Overall, the GenSLM algorithm first **tokenizes**, i.e., breaks up sequences in to chunks of three nucleotides. 
- Then, the input sequence are **vectorized** creating a $1 \times 512$ vector for each inputed sequence.
:::
:::
:::
::: {.column width="60%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 5: Generic Embedding Workflow](images/Tokenize_vectorize.png){width="50%"}
:::
:::
:::

### Transformer
::: columns
::: {.column width="40%"}
::: incremental
::: {style="font-size: 0.6em;"}
- First the tokenized input sequence is passed to the transformer encoder.
- The transformer encoder converts 1,024 bp slices into numeric vectors. 
- This is ran recursively through a diffusion model to learn a condensed distribution of the whole sequence.
- The transformers in GenSLM are used to capture local interactions within a genomic sequence.
:::
:::
:::
::: {.column width="60%"}
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 6: GenSLM Transformer Architecture; [@genslms2023]](images/GenSLM.png)
:::
:::
:::

### Diffusion + Transformer


::: incremental
::: {style="font-size: 0.6em;"}
- Diffusion and Transformer models work in tandem in GenSLM.
- Transformers captures local interactions, while the diffusion model captures learns a representation of transformers vectors to represent the whole genome.
:::
:::


:::


# [Downstream]{style="color:white;"}{background-color="#8FA9C2"}

## Downstream

::: incremental
::: {style="font-size: 0.6em;"}
- Consider our colereactal cancer example, in the context of precision medicine, you may want to classify aggressive and non-aggressive cancers. 

- Or in the context of SARS-CoV-2, you may want to be able to monitor changes in variants and how they differ from previous variants. 

- Assuming embeddings in both examples, the tasks are considered **downstream** from the process of creating the embeddings from the original data. 
::: 
::: 

## Raw DNA Versus Embedded Data

```{r}
library(DT)

variant_data_subset <- read.csv("https://raw.githubusercontent.com/DHintz137/Embedding_Presentation/main/raw-data/variant_data_subset.csv")

font.size <- "10pt"
variant_data_subset %>% 
   DT::datatable(
     options=list(
       initComplete = htmlwidgets::JS(
          "function(settings, json) {",
          paste0("$(this.api().table().container()).css({'font-size': '", font.size, "'});"),
          "}")
       ) 
     )
```


## Project Workflow

::: {style="text-align:center; font-size: 0.5em;"}
![Figure 7: Project Workflow](images/Overall_workflow.png){width="30%"}
:::

# [Data and Processing]{style="color:white;"}{background-color="#8FA9C2"}

## Data Description


::: incremental
::: {style="font-size: 0.6em;"}
- All Covid sequences were downloaded with permission from the Global Initiative on Sharing All Influenza Data (GISAID).
- No geographical or temporal restrictions were placed on the sequences extracted.
:::
:::

## Data Cleaning 

::: incremental
::: {style="font-size: 0.6em;"}
- GISAID's exclusion criteria was used to remove sequences of poor quality. 
- Additional exclusion criteria was implemented to remove all sequence with ambiguous nucleotides (the `NA's` of the genomic world) and large gap sizes post-alignment.
:::
:::


## Multiple Sequence Alignment 

::: incremental
::: {style="font-size: 0.6em;"}
- An multiple alignment alignment was performed to be able to slice the exact location of proteins across different sequences.
:::
::: 

. . .


::: {style="text-align:center; font-size: 0.5em;"}
![Figure 18: Un-aligned Sequences](images/Unaligned_example.png){width="30%"}
:::
. . .

::: {style="text-align:center; font-size: 0.5em;"}
![Figure 19: Aligned Sequences](images/aligned_example.png){width="30%"}
:::


# [Exploratory Data Analysis]{style="color:white;"}{background-color="#8FA9C2"}

<!--
## [What are the Patterns and Structures Oberved from the Sequence Data?]{style="font-size: 52px;"}

::: incremental
::: {style="font-size: 0.6em;"}
- For Exploratory Data Analysis, the goal was the characterize the structure of the data in its sequence representation. 
-  This laid the groundwork for subsequent comparisons with the embedding structure.
:::
::: 

. . .

::: {style="text-align:center; font-size: 0.5em;"}
![Figure 8: Mutation Boxplot](images/Whole_Genome_Mutation_Count_Hamming_Distances.png){width="50%"}
:::
-->

## 

::: {style="text-align:center; font-size: 0.5em;"}
![Figure 11: Hamming Distance of Aligned Proteins to Wuhan Reference Sequence](images/hamming_prot_heatmap_581_all_variants.png){width="100%"}
:::


# [Methodology]{style="color:white;"}{background-color="#8FA9C2"}

## Evaluating Embeddings 

::: incremental
::: {style="font-size: 0.6em;"}
- Broadly speaking, the quality of an embedding is assessed based upon its information richness and the degree of non-redundancy.

- Their are two main methodologies for evaluating embeddings: 
    - When downstream learning tasks are evaluated, it’s referred to as **extrinsic evaluation** [@lavravc2021representation].
    - When the qualities of the embedding matrix itself are assessed it is referred to as **intrinsic evaluation** [@lavravc2021representation].
:::
:::

## Intrinsic Evaluation

::: incremental
::: {style="font-size: 0.6em;"}
- There are three main qualities to asses when studying the the quality of an embedding: **Redundancy**, **Separability**, **Preservation of Semantic Distance**
    - Redundancy reveals the efficiency of a embeddings encoding; more efficient encodings tend to perform better and use fewer computational resources.
    - Separability gives practical insight into whether or not the embedding output can be separated into meaningful genomic groups (i.e., variants). 
    - Preservation of semantic distance is important for determining if the structural representation of information remains valid for subsequent tasks.

- Sub-methods used for intrinsic evaluation include Singular Value Decomposition (SVD), Distance matrices, Radial Dendograms, and Principal Copnonent Analysis.
:::
:::

## Extrinsic Evaluation

::: incremental
::: {style="font-size: 0.6em;"}
- Extrinsic evaluation pursues practical benchmarks to assess the performance of an embeddings algorithm per its associated embedding matrix across varying levels of task complexity.
- For GenSLM, the subset of possible tasks chosen were variant and protein classification.
- A Classification and Regression Tree (CART) was used for performing the classification.
:::
:::

# [Results]{style="color:white;"}{background-color="#8FA9C2"}

## [Intrinsic Evaluation Results]{style="color:#6FB1D1;"}

### [Redundancy]{style="color:#A9D0E3;"}

#### [Singular Value Decomposition]{style="font-size: 0.8em;"}

::: {style="text-align:center; font-size: 0.5em;"}
![Figure 13: SVD CPE Plot](images/SVD_582_cpe_2.png){width="92%"}
:::

### [Preservation of Semantic Distance]{style="color:#A9D0E3;"}

#### Distance Matrices
<!--
![](/images/variant_labels/alpha.png){.absolute top=675 left=110 width="20.55" height="16.91"}
-->


<!--X axis labels-->
<!--
![](images/variant_labels/beta.png){.absolute top=675 left=228 width="18.17" height="29.48"}
![](images/variant_labels/delta.png){.absolute top=675 left=340 width="14.64" height="24.91"}
![](images/variant_labels/epsilon.png){.absolute top=675 left=455 width="15.74" height="21.83"}
![](images/variant_labels/gamma.png){.absolute top=675 left=575 width="20.07" height="25.14"}

![](images/variant_labels/mu.png){.absolute top=675 left=692 width="20.71" height="25.14"}
![](images/variant_labels/omicron.png){.absolute top=675 left=815 width="21.45" height="22.91"}
-->
<!--Y axis labels-->
<!--
![](images/variant_labels/alpha_90degrees.png){.absolute top=630 left=37.5 width="16.91" height="20.55"}
![](images/variant_labels/beta_90degrees.png){.absolute top=580 left=32 width="29.48" height="18.17"}


![](images/variant_labels/delta_90degrees.png){.absolute top=538 left=32 width="24.91" height="14.64"}

![](images/variant_labels/epsilon_90degrees.png){.absolute top=490 left=32 width="21.83" height="15.74"}

![](images/variant_labels/gamma_90degrees.png){.absolute top=440 left=32 width="25.14" height="20.07"}

![](images/variant_labels/mu_90degrees.png){.absolute top=390 left=32 width="25.14" height="20.71"}

![](images/variant_labels/omicron_90degrees.png){.absolute top=340 left=32 width="22.91" height="21.45"}
-->


::: panel-tabset
##### Sequence 
::: {style="margin-top: 0px;"}
:::

::: {style="font-size: 0.6em;"}
- There is a lot of heterogeneity in the differences among sequences as fine scales. 
- Notice the L shape in striations in alpha sequences compare to other variants. 
:::

<iframe class="plotlyFrame" src="./seq_dist_matrix.html" style="border: none;"></iframe>

##### Embeddings 
::: {style="margin-top: 5.5px;"}
:::

::: {style="font-size: 0.6em;"}
- Comparing the Sequence distance matrix to the Embedding distance matrix we see that finer differences amongst sequences are lost in the embedding transformation. 
:::

<iframe class="plotlyFrame" src="./emb_dist_matrix.html" style="border: none;"></iframe>

##### Abs. Diff.
::: {style="margin-top: 5px;"}
:::

::: {style="font-size: 0.6em;"}
- The absolute difference matrix is the absolute difference of the sequence distance matrix and the embedding distance matrix.
:::

<iframe class="plotlyFrame" src="./abs_diff_dist_matrix.html" style="border: none;"></iframe>

##### Reg. Diff.
::: {style="margin-top: 18px;"}
:::

::: {style="font-size: 0.6em;"}
- The difference distance matrix is the sequence distance matrix minus the embedding distance matrix.
:::

::: {style="margin-top: 18px;"}
:::

<iframe class="plotlyFrame" src="./diff_dist_matrix.html" style="border: none;"></iframe>

:::


#### Radial Dendrograms (1)

::: incremental
::: {style="font-size: 0.6em;"}
- Radial dendrograms show who easily sequences cluster by their variant.
- Figures 12 and 13 in the next two slides show:
    - Radial dendrograms using sequence data does not perfectly clusters sequences by variants. 
    - Radial dendrograms using the embedded data perfectly clusters sequences by variants. 
:::
::: 

#### Radial Dendrograms (2)

![](images/Radial_dendrograms.png)


### [Separability]{style="color:#A9D0E3;"}

### Principal Component Analysis

::: {style="margin-top: 70px;"}
:::

<iframe class="plotlyFrame" src="./v_pc_plotly.html" style="border: none;"></iframe>


## [Extrinsic Evaluation Results]{style="color:#6FB1D1;"}

### [Variant Classification]{style="color:#A9D0E3;"}

#### [Variant Embedding Classification (1)]{style="font-size: 0.8em;"}

::: incremental
::: {style="font-size: 0.5em;"}
- Variant classification with a CART learner had a 97.71% on aligned sequence embeddings.
:::
:::

. . .


![Figure 17: Variant Missclassifcations](images/Variant_CART_tree.png){width="67%"}

#### Variant Embedding Classification (2)

::: {style="text-align:center; font-size: 0.5em;"}
![Figure 17: Variant Missclassifcations](images/Variant_missclassifcations.png){width="40%"}
:::

. . . 

::: {style="font-size: 0.6em;"}
- Imagine if we could classify aggressive and non-aggressive cancer with this kind of accuracy. 
::: 

#### Variant One-Hot-Encoding Classification


. . .

::: incremental
::: {style="font-size: 0.6em;"}
- Variant classification with a CART learner had a 10.28% on One-Hot-Encoded sequences.
:::
:::

. . .


{{< pdf images/CART_variant_binary_cropped.pdf width=900 height=400 >}}

:::{.notes}
-  one-hot encoding Covid sequences (assuming all sequences are of length 30,000) creates a one dimensional array of $n \times 30,000 \times 4$, 4 columns representing 4 different bases

- A: \([1, 0, 0, 0]\)
- C: \([0, 1, 0, 0]\)
- G: \([0, 0, 1, 0]\)
- T: \([0, 0, 0, 1]\)

where a there is 30,000 rows to represent all 30,000 nucleotides in a Covid sequence; thus with n matrices of $30,000 \times 4$ comprising the multidimensional array for each patient (ie each whole genome sequence)
::: 

<!--
### Protein Classification

. . .

::: panel-tabset
#### Embedding


::: {style="font-size: 0.5em;"}
- 100% classification rate using the all 512 dimensions from the GenSLM embedding.
:::


{{< pdf images/protein_CART.gv.pdf width=900 height=400 >}}

#### PCA Reduced Embedding

::: {style="font-size: 0.5em;"}
- 100% classification rate using the PCA reduced feature matrix with 28 dimensions.
:::


{{< pdf images/protein_CART_29_PCA_components.gv.pdf width=900 height=400 >}}
:::
-->

## Thesis Question

::: {style="text-align:center; font-size: 0.85em;"}
How easily can information be extracted from GenSLM embeddings while maintaining its original integrity; and what is the quality of the produced vector embeddings?
:::

## Conclusion
::: incremental
::: {style="font-size: 0.6em;"}
- The study assessed GenSLM's embeddings for quality using intrinsic and extrinsic evaluations, focusing on redundancy, separability, and information preservation.
- While Covid DNA can be highly related with a little as 0.067% differentiating some variates, the GenSLM embedding did indctaed strong performance.
    - GenSLM demonstrated strong performance in classification tasks, with a 97.71% success rate for variant classification and 100% for proteins using the CART learner, significantly outperforming One-Hot-Encoded matrices.
    - GenSLM aslo excelled in separability.
- However , GenSLM also displayed suboptimal features    
    - GenSLM was not effective in preserving fine-scale genetic differences and distorted the genetic distances between certain viral variants.
- Overall, GenSLM has room to improve considering its dimensional redundancy and only moderate success in preserving semantic distances.
- Future research should compare GenSLM with other neural network embedding algorithms and explore its utility in more diverse genomic analysis tasks to establish broader applicability.
:::
:::

# `r fontawesome::fa("hand-holding-heart", "white")` [Acknowledgements]{style="color:white; font-size: 70px;"} {background-color="#8FA9C2"}

::: incremental
::: {style="color:white; font-size: 0.6em;"}
- A big thank you to my Committee Tim Robinson, Shaun Wulff, Sasha Skiba and Nick Chia.
- An Extra special thank you Nick for sticking by me!
- Robert Petit for Introducing me to Bioinformatics!
- Liudmila Mainzer for all your support, training and guidance!
- My Cohort: Allie Midkiff, Oisín O'Gailin, Austin Watson, Sandra Biller, Austin Watson, Daiven Francis, and Joe Crane. 
- My partner Hana for all your support and patience!
- And another big thank you to Tim!
:::
:::


# [Thank you!!]{.center style="color:white;"}{background-color="#8FA9C2"}

# [Appendix]{style="color:white;"}{background-color="#8FA9C2"}

##

{{< pdf DHintz_PlanB_Final.pdf width=1000 height=650 >}}


## Word Embeddings 

::: incremental
::: {style="font-size: 0.6em;"}
- Before we introduce GenSLM, lets look at a simpler application in natural language.
    - If we consider English text, **How can we measure the similarity between words?**
    - For example, what is the semantic difference between the words "King" and "Woman".
    - Let's demonstrate the semantic relationships of “King”, “Queen”, “Man”, “Woman”.
:::
:::

::: {.notes}
- Embedding algorithms need data (a corpus of text) to construct numeric vectors relating words and their meaning.
:::


## Word2vec Example

. . .

```{webr-r}
#| context: setup

options(max.print = 15)
# Download a dataset
url <- "https://raw.githubusercontent.com/DHintz137/Embedding_Presentation/main/raw-data/King_Man_Woman_Queen_pred.csv"
download.file(url, "King_Man_Woman_Queen_pred.csv")
embedding_raw <- read.csv("King_Man_Woman_Queen_pred.csv")
embedding <- t(data.frame(embedding_raw[,-1] ,row.names = embedding_raw[,1]))
colnames(embedding) <- c("King", "Man", "Woman", "Queen")
# embedding <- as_tibble(embedding)

print_tidy <- function(x, ...) {
  if (is.matrix(x) || length(dim(x)) > 1) {
    # If x is a matrix or array, convert each column into a tibble column
    out <- as_tibble(as.data.frame(x))
  } else if (is.vector(x)) {
    # If x is a vector, convert it into a single column tibble
    out <- tibble(value = x)
  } else {
    # Fallback for other data types
    out <- as_tibble(x)
  }
  print(out, ...)
}
```


```{r, echo=TRUE, eval=FALSE}
#| class-source: small-code
#| classes: small-code

## Pre-Ran Code ##
library(word2vec)
# using British National Corpus http://vectors.nlpl.eu/repository/
model <- read.word2vec("British_National_Corpus_Vector_size_300_skipgram", normalize = TRUE)
embedding <- predict(model,newdata = c("king_NOUN","man_NOUN","woman_NOUN","queen_NOUN"),type="embedding")
```


::: {style="margin-top: 20px;"}
:::

. . .

```{webr-r}
#| class-source: small-code
#| classes: small-code
print_tidy(embedding)
print_tidy(embedding[ ,"King"] - embedding[ , "Man"] + embedding[ , "Woman"])
print_tidy(t(embedding[ , "Man"]) %*% embedding[ , "Woman"])
print_tidy(t(embedding[ , "Man"]) %*% embedding[ , "King"])
```

## Plotting Word Embeddings

. . .

::: {style="font-size: 0.6em;"}
- We can Plot the vectors shown in the previous slide to demonstrate how word2vec produces embeddings that measures semantic similarity between words. 
:::

. . .

::: {style="text-align:center; font-size: 0.5em;"}
![Figure 4: Word Embedding Vectors ](images/King_Man_Woman_Queen.png){width="70%"}
:::


#### Kullback–Leibler Divergence (KLD)
::: {style="text-align:center; font-size: 0.5em;"}
![Figure 15: CKL vs DKL](images/CKL_vs_DKL_ver2.png){width="85%"}
:::

# [References]{.center style="color:#464D58;"}{background-color="#8FA9C2"}