main.tex

\documentclass[12pt,english]{article}
\usepackage{graphicx} % Required for inserting images
\usepackage{amsmath}  % Required for mathematical features
\usepackage{mathtools}
\usepackage{hyperref} % Recommended for including hyperlinks
\usepackage{booktabs} % For prettier tables
\usepackage{amssymb} % For using blackboard bold (like \mathbb{R})
\usepackage{amsmath}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{ae,aecompl}
\renewcommand{\familydefault}{\rmdefault}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{geometry}
\geometry{verbose,tmargin=1in,bmargin=1in,lmargin=1in,rmargin=1in}


\title{Thesis}
\author{Max Weinberg}
\date{May 2024}

\begin{document}

\maketitle

\section{Methods}

\subsection{Animal Care}

\subsection{Viral Injections}

\subsection{Implantation}

\subsection{Fiber-Photometry}

\subsection{Tracking}


After the video was collected it was immediately uploaded to Microsoft OneDrive for fidelity. Sleap XXX Insert Ref Here... Pereira 2019 et al XXX was used to train a convolutional neural network to identify 22 key points on both mice for every frame. The training process was iterated by training on a set of labeled frames, generating predictions, and then identifying the worst performing frames and labeling those. 
% put in image of skeleton with numbers next to each node and list each node


After labeling 5000 frames the training process was deemed sufficient and stopped. Predictions for the key points on every frame of every video were then extracted and manually inspected to guarantee no identity swaps. 
%Include all model parameters in a nice little table? Organized with other model parameters used? 

%Include Some plot which is MAP on Y and on X is Num frames labeled

%Include final model diagnostics plot

\subsection{Autoencoder}

In spite of all the effort to mitigate any errors in the generation of the tracks, two major problems in the tracking remained. The first was the missing points on frames where one mouse was obfuscated. The second was the jitter generated by the use of confidence maps. To fix these problems, we employed a denoising temporal convolutional autoencoder.

An autoencoder consists of two feedforward neural networks: an encoder and a decoder. Given a dataset \( D \) $\in$ \( \mathbb{R}^{N \times M} \), where \( N \) is the number of frames and \( M \) is the number of nodes. The encoder part of the autoencoder compresses the dataset \( D \) into a latent space representation \( Z \), defined as:
\[ Z = f_\theta(D) \]
where \( f_\theta \) is a parametric function (the encoder) characterized by parameters \( \theta \). The decoder then attempts to reconstruct the original dataset from this latent representation:
\[ \hat{D} = g_\phi(Z) \]
where \( g_\phi \) is another parametric function (the decoder) characterized by parameters \( \phi \). We trained the autoencoder to minimize the difference between \( D \) and \( \hat{D} \), using the mean squared error:
\[ L(\theta, \phi) = \frac{1}{N} \sum_{i=1}^{N} \|D_i - \hat{D}_i\|^2 \]

To effectively handle the temporal nature of the data we employed convolutional layers within the encoder. Convolutional layers are well-suited for this task because they can capture local temporal dependencies by applying kernels that slide over the input data, thus learning patterns and features that are invariant to translation.

By compressing the point data into \( Z \), the network was forced to learn the relationships between the points such that it could infer where a point should be when it is missing. Further, by introducing noise sampled from \( \mathcal{N}(0, \text{Var}(D_j)) \), where \( \text{Var}(D_j) \) is the variance of the \( j \)-th column of \( D \), the frame-wise jitter was reduced. A grid search was performed over the bottleneck size to minimize test loss. All parameters are laid out in the appendix.

In order to allow the autoencoder to learn the relationship between the points, significant normalization was performed. The normalization consisted of five steps:

1. \textbf{Ego-centering}: Each mouse's position data was centered by subtracting the centroid of the reference frame and the centroid of the individual frame:
   \[
   \mathbf{x}_{\text{mouse}}' = \mathbf{x}_{\text{mouse}} - \mathbf{c}_{\text{frame}}
   \]
   where \(\mathbf{c}_{\text{frame}} = \left(\frac{1}{n}\sum_{i=1}^n x_i, \frac{1}{n}\sum_{i=1}^n y_i\right)\) is the centroid of the frame. If a point \(i\) was missing, the centroid and rotational center were computed excluding that point.

2. \textbf{Rotational-center}: The points were aligned by rotating them to match the reference frame:
   \[
   \mathbf{x}_{\text{rot}} = R \mathbf{x}_{\text{mouse}}'
   \]
   where \(R\) is the rotation matrix:
   \[
   R = \begin{bmatrix}
   \cos\theta & -\sin\theta \\
   \sin\theta & \cos\theta
   \end{bmatrix}
   \]
   The angle \(\theta\) was determined based on the reference frame excluding the missing point \(i\).

3. \textbf{Normalization by area}: Assuming a mouse is a 2D ellipse, the semi-major and semi-minor axes were used to approximate the area, and the coordinates were scaled:
   \[
   A = \pi a b
   \]
   \[
   \text{scale factor} = \frac{1}{\sqrt{\text{trimmean}(A, 20)}}
   \]
   where \(a\) and \(b\) are the semi-major and semi-minor axes.

4. \textbf{Identification of NaNs}: Frames resulting in NaNs were filtered by checking the number of valid points.

5. \textbf{Repeat}: The process was repeated to further reduce noise by re-running the normalization after detecting outlying frames.


\subsection{Feature Extraction}
To extract meaningful social behaviors from the cleaned data, a shared feature space is defined. This space consists of both within animal measurements and between animal measurements. Both distances and angles were used and spectrograms of both were generated.


\subsection{Distance Calculation}
Distances were computed between specific points on the animals and certain reference points, such as the centroid or the center of the bounding box. The centroids were computed using a trimmed mean to reduce the influence of outliers. The center of the bounding box was determined by the percentiles of the centroid coordinates. For each feature specified in the feature table, data points were collected, and distances were calculated using the Euclidean distance. The computed distances were then processed to remove outliers, fill missing values, and apply smoothing filters as specified in the feature table. Derivatives of the distance features were also computed if required. The process for calculating distances is outlined below:

\begin{algorithm}
\caption{Calculate Distance Features}
\begin{algorithmic}[1]
\State \textbf{Initialize} distance\_features as an empty structure
\State \textbf{Compute} centroids of resident and intruder using trimmed mean
\State \textbf{Calculate} box\_center based on the centroids

\For{each feature in feature\_table}
    \If{feature is included}
        \State \textbf{Collect} data for point 1 based on feature table
        \State \textbf{Collect} data for point 2 based on feature table
        \State \textbf{Compute} distance as:
        \[
        d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}
        \]
        \State \textbf{Truncate} data based on specified truncation style
        \State \textbf{Interpolate} to fill missing values if necessary
        \State \textbf{Apply} running average filter if specified
        \State \textbf{Store} computed distance in distance\_features
        \If{first derivative is included}
            \State \textbf{Compute} first derivative and store
            \If{second derivative is included}
                \State \textbf{Compute} second derivative and store
            \EndIf
        \EndIf
    \EndIf
\EndFor
\State \textbf{Return} distance\_features
\end{algorithmic}
\end{algorithm}


\newpage
\subsection
{Angle Calculation}
Angles were computed between specific points on the animals and certain reference points, such as centroids and anatomical points. The process for calculating these angles is outlined below:

\begin{algorithm}
\caption{Calculate Angle Features}
\begin{algorithmic}[1]
\State \textbf{Initialize} angle\_features as an empty structure

\For{each feature in feature\_table}
    \If{feature is included}
        \State \textbf{Collect} data for point 1 based on feature table
        \State \textbf{Collect} data for vertex based on feature table
        \State \textbf{Collect} data for point 2 based on feature table
        
        % Calculate vectors between points
        \State $ \mathbf{v1} = \mathbf{p1} - \mathbf{vertex} $
        \State $ \mathbf{v2} = \mathbf{p2} - \mathbf{vertex} $
        \State \textbf{Compute} Cosine of angle as:
        \[
        \cos(\theta) = \frac{\mathbf{v_1} \cdot \mathbf{v_2}}{\|\mathbf{v_1}\| \|\mathbf{v_2}\|}
        \]
        \State \textbf{Truncate} data based on specified truncation style
        \State \textbf{Interpolate} to fill missing values if necessary
        \State \textbf{Apply} running average filter if specified
        \State \textbf{Store} computed angle in angle\_features
        \If{first derivative is included}
            \State \textbf{Compute} first derivative and store
            \If{second derivative is included}
                \State \textbf{Compute} second derivative and store
            \EndIf
        \EndIf
    \EndIf
\EndFor
\State \textbf{Return} angle\_features
\end{algorithmic}
\end{algorithm}

Angles were calculated between three points: two points on the animals and a vertex point. The cosine of the angle was used to simplify subsequent steps and interpretations.


\subsection{Wavelet Transform}
To include time-varying components of behavior, a continuous wavelet transform (CWT) was performed over both the distances and the angles. To reduce the number of features for which a wavelet transform was performed, as the CWT expands the feature vector dimensionality by the number of scales used, each of the feature sets were reduced with a principal components analysis (PCA).

Specifically, for the angle matrix, \( A \in \mathbb{R}^{N \times M_{a}} \), where \( N \) is the number of frames and \( M_{a} \) is the number of features for the angles, PCA was applied to transform \( A \) into a lower-dimensional subspace. The transformation is given by:
\[
A_{\text{PCA}} = A W
\]
where \( W \) is the matrix of principal component vectors. The resulting matrix \( A_{\text{PCA}} \) retains the most significant components \( R \) that capture the variance in the data, thereby reducing the dimensionality while preserving essential information.

The same procedure was applied to the distance matrix, \( D \in \mathbb{R}^{N \times M_{d}} \), where \( M_{d} \) is the number of features for the distances. By performing PCA, we obtained \( D_{\text{PCA}} = D V \), where \( V \) is the matrix of principal component vectors for the distances. For both the distances and the angles, the number of components to maintain was chosen to be the amount with which 99 percent of the variance could be explained. 
The continuos wavelet transform is defined as:
\[
W_x(a, b) = \frac{1}{\sqrt{a}} \int_{-\infty}^{\infty} x(t) \psi^*\left(\frac{t - b}{a}\right) dt
\]
where \( x(t) \) is the signal, \( \psi(t) \) is the mother wavelet, \( a \) is the scale parameter, \( b \) is the translation parameter, and \( \psi^* \) denotes the complex conjugate of the mother wavelet.

For our choice of mother wavelet, we used a morlet wavelet, defined as:
\[
\psi(t) = \pi^{-\frac{1}{4}} e^{i \omega_0 t} e^{-\frac{t^2}{2}}
\]
where \( \omega_0 \) is the central frequency. 

The scales were defined as dyadically spaced between \( f_{\min} \) and \( f_{\max} \), where \( f_{\max} \) is given by the Nyquist frequency, which is 15 Hz:
\[
a = 2^j, \quad j \in \left[\log_2\left(\frac{1}{f_{\max}}\right), \log_2\left(\frac{1}{f_{\min}}\right)\right]
\]


Once the dimensionality was reduced, the CWT was applied to the principal components of both angles and distances, capturing the time-frequency characteristics of the behavioral data. This approach allowed us to analyze the temporal dynamics of the features with reduced computational complexity.


% include figure of PC 1 - N
% include figure of wavelet transformed feature

\subsection{Dimensionality Reduction}
Given the new expanded dataset, \( D \) $\in$ \( \mathbb{R}^{N \times M+(a * R)} \) where \( a \) is the number of scales used. This new dataset should now contain a meaningful representation of behavior. However, the new dataset likely lies on a much lower dimensional manifold given all the redundancy in the wavelet transform and in the features chosen. To reduce the complexity and to find the underlying manifold, following Berman et al., 2014, T-Distributed Stochastic Neighbor Embedding (Tsne) was used. 

Tsne attempts to model the local similarities between all data points. It starts by constructing a distance matrix 

Given the new expanded dataset, \( D \in \mathbb{R}^{N \times (M + a \times R)} \), where \( N \) is the number of data points, \( M \) is the original number of features, \( a \) is the number of scales used in the wavelet transform, and \( R \) is the number of features per scale. This new dataset should now contain a meaningful representation of behavior. However, due to the inherent redundancy in the wavelet transform and the chosen features, the new dataset likely lies on a much lower dimensional manifold.

To reduce the complexity and uncover the underlying manifold structure, we employ T-Distributed Stochastic Neighbor Embedding (t-SNE), following the methodology described by Berman et al., 2014. t-SNE is a nonlinear dimensionality reduction technique well-suited for modeling local similarity in the data.

First a distance matrix is constructed between all data points in the high-dimensional space. The distance function, \(F_{d}\), is defined by the user. Afterwards a gaussian kernel is employed to transform the distance matrix into a pairwise affinities matrix. These probabilities are defined such that similar points are assigned higher probabilities, and dissimilar points are assigned lower probabilities. The algorithm proceeds in the following steps:

1. **Joint Probability Distribution in High-Dimensional Space**: For each pair of data points \( (i, j) \) in the high-dimensional space, t-SNE calculates the conditional probability \( p_{j|i} \) that point \( j \) would pick point \( i \) as its neighbor if neighbors were chosen in proportion to their probability density under a Gaussian centered at \( i \). The joint probabilities \( p_{ij} \) are then symmetrized.

2. **Joint Probability Distribution in Low-Dimensional Space**: Similarly, t-SNE defines a joint probability distribution \( q_{ij} \) over the points in the low-dimensional map using a Student's t-distribution with one degree of freedom. 

3. **Kullback-Leibler Divergence Minimization**: The optimal low-dimensional representation is found by minimizing the Kullback-Leibler divergence between the joint probability distributions \( P \) (in high-dimensional space) and \( Q \) (in low-dimensional space). This is done using gradient descent, adjusting the positions of points in the low-dimensional space to best preserve the pairwise similarities from the high-dimensional space.

The first step, constructing the distance matrix, represents a problem in our case due to the different units used. Specifically, \( D \) contains three different units: distances, angles, and wavelet amplitudes. To allow the variance of each unit to contribute meaningfully, separate distance matrices were computed for each. For the distance matrix of distances, the euclidean distance metric is used, for the Angles, cosine distance, and for the wavelet amplitudes, because they are positive semi-definite the kullback-leibler divergence is used. The resulting distance matrices are averaged and then  tsne was performed. 

Due to tsne being non-parametric, a training set of uniformly sampled points was created and a multilayer perceptron was used to identify a map to the low dimensional space.  

\subsection{Behavior Identification}
Given the new embedded space we wish to identify high density regions as they likely represent stereotyped or often displayed behaviors. To do this we first smooth the space using a two dimensional gaussian convolution and then a watershed algorithim is performed to find the local maximum of the probability density function. To find the optimal standard deviation for the gaussian kernel, a number of desired behaviors is selected and then the algorithm is run with increasing standard deviations until the resulting number of distinct regions are identified.  


the brain is weird. % @ai: formalize this


\section{Appendix}

\subsection{Model Parameters}
\subsubsection{Sleap CNN}

\begin{table}[ht]
\centering
\begin{tabular}{@{}lll@{}}
\toprule
\textbf{Parameter} & \textbf{Value} & \textbf{Description} \\ \midrule
Learning Rate & 0.01 & Linear schedular. \\
Epochs & 50 & Iterations of backpropagation. \\
Batch Size & 32 & Training examples in one iteration. \\
Momentum & 0.9 & Accelerates SGD. \\
Optimizer & Adam & First-order gradient-based optimization. \\
\bottomrule
\end{tabular}
\caption{Model parameters used in the convolutional neural network.}
\label{tab:model_params}
\end{table}

\subsubsection{Autoencoder}

\begin{table}[ht]
\centering
\begin{tabular}{@{}lll@{}}
\toprule
\textbf{Parameter} & \textbf{Value} & \textbf{Description} \\ \midrule
Learning Rate & 0.01 & Linear schedular. \\
Epochs & 50 & Iterations of backpropagation. \\
Batch Size & 32 & Training examples in one iteration. \\
Momentum & 0.9 & Accelerates SGD. \\
Optimizer & Adam & First-order gradient-based optimization. \\
\bottomrule
\end{tabular}
\caption{Model parameters used in the convolutional neural network.}
\label{tab:model_params}
\end{table}

\end{document}


\end{document}