Seurat_tut_1_git_med.Rmd

---
title: "Seurat_Tutorial_PBM"
output: html_notebook
---
In this tutorial, we will be looking at a dataset of Peripheral Blood Mononuclear Cells (PBMC). PBMCs are like the army of immune cells that live in our blood and fight off any invaders like viruses or bacteria.

The dataset we have has 2,700 single PBMCs that have been sequenced using a fancy Illumina machine called the NextSeq 500. We'll be using a function called Read10X() to read in the data from the 10X Genomics pipeline. This function will give us a matrix that tells us how many different molecules of each gene were detected in each of the 2,700 PBMCs.

We'll then use this matrix to create a Seurat object. Think of this as creating a "fellowship of cells", like in Lord of the Rings.

Here's the code we'll use to load in the dataset and create our Seurat object:

```{r}
library(dplyr)
library(Seurat)
library(patchwork)
pbmc.data <- Read10X(data.dir = "pbmc3k/filtered_gene_bc_matrices/hg19/")
pbmc.data #32738 x 2700 sparse Matrix of class "dgCMatrix"
#a class of sparse numeric matrices in the compressed, sparse, column-oriented format. In this implementation the non-zero elements in the columns are sorted into increasing row order. dgCMatrix is the “standard” class for sparse numeric matrices in the Matrix package.

```


```{r}
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
```

Next, we'll need to do some pre-processing on our data to make sure it's ready for analysis. This involves selecting and filtering the best quality cells based on various metrics, normalizing and scaling the data, and identifying highly variable genes that will be important for our analysis. 
Think of this as preparing our fellowship for battle by making sure everyone is healthy and equipped with the right weapons.

**Standard pre-processing workflow**

A few QC metrics commonly used by the community include 
- the number of unique genes detected in each cell.
- Low-quality cells or empty droplets will often have very few genes
- Cell doublets or multiplets may exhibit an aberrantly high gene count
- Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)
- The percentage of reads that map to the mitochondrial genome
- Low-quality / dying cells often exhibit extensive mitochondrial contamination
- We calculate mitochondrial QC metrics with the PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of features
- We use the set of all genes starting with MT- as a set of mitochondrial genes
     
```{r}
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
head(pbmc@meta.data, 5)
```
   
In the example below, we visualize QC metrics, and use these to filter cells.
We filter cells that have unique feature counts over 2,500 or less than 200
We filter cells that have >5% mitochondrial counts
    
  
```{r}
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
```

```{r}
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used for anything calculated by the object, i.e. columns in object metadata, PC scores etc.

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
```
Pearson correlation between the two features is displayed above the plot.
*nFeature_RNA vs nCount_RNA*:

nCount represents the total number of reads or molecules captured for each cell, while nFeature RNA corresponds to the number of unique genes detected in each cell. By plotting the correlation between these two variables, researchers can assess the overall data quality. Cells with low nCount and nFeature RNA may indicate poor capture or low-quality data, and they can be potentially excluded from downstream analyses to ensure robust results. Here we see that we do not poor quality data. In general we do want to see higher correlation between these two variables.

Outliers can arise due to various technical or biological factors, such as cell doublets or experimental artifacts. The correlation plot helps to identify such outliers by highlighting cells that deviate from the expected relationship between nCount and nFeature RNA.

 
```{r}
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

```
Compare the above Violin plot to understand which cells were filtered.

```{r}
#SCTransform(object = pbmc)
#Use this function as an alternative to the NormalizeData, FindVariableFeatures, ScaleData workflow
#Results are saved in a new assay (named SCT by default) with counts being (corrected) counts, data being log1p(counts), scale.data being pearson residuals; sctransform::vst intermediate results are saved in misc slot of new assay.
```


**Normalizing the data**

Normalization is like making sure everyone in our fellowship is speaking the same language and using the same measuring system. This will help us compare gene expression levels between different cells.

We utilize a normalization technique called "LogNormalize" to standardize the feature expression measurements for each individual cell. This normalization method involves dividing the expression values of each feature by the total expression of that cell, followed by multiplication with a scaling factor (typically set to 10,000 by default). Finally, we apply a logarithmic transformation to the normalized values. This process helps ensure that the expression values are comparable across cells and enables us to effectively analyze and interpret the data.

```{r}
#pbmc <- NormalizeData(pbmc, normalization.method = "RC", scale.factor = 1e6)

```

```{r}
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

```
```{r}

par(mfrow=c(1,2))
# original expression distribution
raw_geneExp = as.vector(pbmc[['RNA']]@counts) %>% sample(10000)
raw_geneExp = raw_geneExp[raw_geneExp != 0]
hist(raw_geneExp)
# expression distribution after normalization
logNorm_geneExp = as.vector(pbmc[['RNA']]@data) %>% sample(10000)
logNorm_geneExp = logNorm_geneExp[logNorm_geneExp != 0]
hist(logNorm_geneExp)
```


**Identification of highly variable features (feature selection)**
Now that we've normalized our data, we need to identify which genes are most important for our analysis. We do this by identifying highly variable genes.

Highly variable genes are like the real heroes in our fellowship, the ones who stand out from the rest and make a big impact. These genes exhibit a lot of variation in expression levels between cells, which can be a good indication of biological significance.

In 'vst' method it fits a line to the relationship of log(variance) and log(mean) using local polynomial regression (loess). Then standardizes the feature values using the observed mean and expected variance (given by the fitted line). Feature variance is then calculated on the standardized values after clipping to a maximum (see clip.max parameter).

```{r}
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

```
 By default, we return 2,000 features per dataset.  You can adjust this number depending on the size and complexity of your dataset. 
 
```{r}
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge=0, ynudge=0)
plot2
```

Once we've identified our highly variable genes, we can use them to focus our downstream analysis and highlight the biological signal in our single-cell dataset. 

Keep up the great work, and don't forget that even the smallest heroes can have a big impact!

**Scaling the data**

The next step is to apply a linear transformation called "scaling". This is a standard pre-processing step that we need to do before we can use techniques like PCA (principal component analysis) to analyze our data.

Scaling is like making sure that everyone's voice is at the same volume in our fellowship. We want to make sure that each gene is given equal importance in downstream analysis, so that highly-expressed genes don't dominate the results.

We'll use the "ScaleData" function in Seurat to apply scaling to our dataset. Here's the code we'll use:

```{r}
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes, do.center = TRUE)
```
In this code, we're using the "ScaleData" function to shift the expression of each gene so that the mean expression across cells is 0, and scale the expression of each gene so that the variance across cells is 1. This step is important because it helps us compare gene expression levels between cells that might have different total amounts of RNA and so that highly-expressed genes do not dominate.

Now that we've scaled our data, we can move on to dimensional reduction techniques like PCA. Keep up the great work, and remember that in our fellowship, everyone's voice is important!

**Perform linear dimensional reduction**

PCA helps us identify patterns in our data by reducing the number of features in our dataset. This makes it easier to visualize and analyze our data. After running PCA, we can move on to clustering our cells and identifying different cell types based on gene expression patterns. Keep up the great work, and remember, with PCA, one does not simply analyze all the features in a dataset!

```{r}
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
```

```{r}
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
```


The VizDimLoadings function helps you understand which genes contribute the most to each dimension. It creates a plot where the genes are ranked based on their contribution to a particular dimension, and their loadings are shown as bar plots.

```{r}
#Visualize top genes associated with reduction components
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
```
X-axis: The x-axis represents the genes. The genes are usually sorted based on their contribution to the given dimension, from highest to lowest.

Y-axis: The y-axis represents the loadings of each gene on the given dimension. The loadings indicate the extent to which a gene contributes to the specific dimension. A higher loading value means the gene has a stronger influence on that dimension.
---------------------------------------------------------------------------------------------

The DimPlot function takes this reduced-dimensional representation and creates a scatter plot where each point represents a cell, and its position is determined by the cell's embeddings in the reduced space.


```{r}
# Dimensional reduction plot
DimPlot(pbmc, reduction = "pca")
# The axes in the scatter plot correspond to the dimensions (components) obtained from the dimensionality reduction technique
# Each point in the scatter plot represents a cell from your dataset
```
The position of the point in this dimplot reflects the cell's embeddings in the reduced space. Cells with similar gene expression profiles tend to cluster together in the plot.

-------------------------------------------------------------------------------------------
DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses.

It can be a valuable tool for exploring correlated feature sets.

```{r}
#Dimensional reduction heatmap
# Explore the primary sources of variation in our dataset using DimHeatmap
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
#The DimHeatmap function produces a heatmap where the rows represent the genes and the columns represent the cells. 
```
The "balanced" argument ensures that the plot has an equal number of genes with both + and - scores.
The color intensity of each cell in the heatmap reflects the expression level of the corresponding gene in that cell (darker color for higher expression).

```{r}
DimHeatmap(pbmc, dims = 1:9, cells = 500, balanced = TRUE)
```
The "position" of a gene along PC_n refers to the magnitude of its loading on this component. A gene with a higher positive loading has a strong positive association with PC_n, meaning its expression levels tend to increase along the direction of this component. Conversely, a gene with a higher negative loading has a strong negative association with PC_n, indicating its expression levels tend to decrease along the direction of this component.

By looking at the expression patterns of our top PCs, we can also start to get a sense of the different cell types present in our dataset.

**Determine the ‘dimensionality’ of the dataset**

When we analyze scRNA-seq data, we often want to identify groups of cells that behave similarly. Seurat can help us do this by clustering cells based on their similarities and differences. To do this, we first need to reduce the dimensionality of the data, which we do using PCA.
Each PC is a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. 

Suppose they wanted to choose how many people should be in the fellowship of the ring, theoretically it's like choosing the members of the "fellowship of the ring" based on their performance in a smaller task. We select the principal components that perform the best in distinguishing between cells and keep them for further analysis.

To answer this question, we use a technique called the JackStraw procedure. We randomly shuffle a small portion of the data (usually 1%) and run PCA on this shuffled data multiple times. By doing this, we create a null distribution of feature scores and identify the significant principal components based on the enrichment of low p-value features.

```{r}
## NOTE: This process can take a long time for big datasets, comment out for expediency. More approximate techniques such as those implemented in ElbowPlot() can be used to reduce computation time

pbmc <- JackStraw(pbmc, num.replicate = 30) # num.replicate = 100 will take longer
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)

```

```{r}
JackStrawPlot(pbmc, dims = 1:15)
```


This process can take a while for large datasets, but there are faster techniques like the ElbowPlot() function that can be used to reduce computation time. Once we have identified the significant principal components, we can use them to cluster cells based on their similarities and differences.


An alternative heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each one.

```{r}
ElbowPlot(pbmc)
```
Just like Gandalf had to choose the right hobbits to join the quest, we have to select the relevant sources of heterogeneity in our dataset. 

Deciding on the number of principal components to keep can be as tricky as deciding on the number of members for a fellowship.

1)The first approach is more supervised, where we explore the principal components to identify relevant sources of heterogeneity, and can be used in conjunction with tools like GSEA, just like using Legolas' keen eyes to explore each principal component and determine its relevance.
2) The second approach is a statistical test based on a random null model, but it may not return a clear cutoff and is time-consuming for large datasets. (JackStraw)
This is very similar to Frodo's journey to Mount Doom, random walking and way too time-consuming with no clear end. 
3) Finally, the third approach is a heuristic (elbow plot) that is commonly used and can be calculated instantly, just like how Samwise used his intuition to make quick decisions.  

In this study, all three approaches produced similar results, and we could have reasonably chosen a cutoff between PC 7-12.

**Cluster the cells**

Seurat v3 uses a graph-based clustering method to group cells with similar feature expression patterns. 
  
  First, a K-nearest neighbor (KNN) graph is constructed based on the euclidean distance in PCA space, and edge weights between cells are refined using shared overlap in their local neighborhoods. This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs). 
  Next, modularity optimization techniques like the Louvain algorithm are applied to group cells together with the goal of optimizing the standard modularity function. This is done using the FindClusters() function, with a resolution parameter that sets the ‘granularity’ of the downstream clustering. The Idents() function can be used to find the clusters.

Just like Frodo needed a trusted community to complete his journey, we need to cluster cells into meaningful groups to complete downstream analyses.

```{r}
pbmc <- FindNeighbors(pbmc, dims = 1:8)
pbmc <- FindClusters(pbmc, resolution = 0.7)
```

```{r}
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
```


**Run non-linear dimensional reduction (UMAP/tSNE)**

Imagine you have a vast kingdom with many cities and towns. Each city represents a group of cells with similar characteristics, and you want to explore them on a map. However, your map is limited to only two dimensions, which means you need a way to compress the information and place similar cities together.

That's where tSNE and UMAP algorithms come in. They are like magical spells that allow you to transform the data into a lower-dimensional space, where similar cells are placed closer together. This helps you visualize and explore the data more easily.

To make sure the cells in each group (or cluster) are located near each other on the map, we suggest using the same "metafeatures" (i.e., PCs) that were used for clustering analysis as input to tSNE and UMAP. This will help you see the differences and similarities between the clusters more clearly.

```{r}
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages = 'umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)
```

```{r}
# note that you can set `label = TRUE` or use the LabelClusters function to help label # individual clusters
DimPlot(pbmc, reduction = "umap", label = TRUE)
```

**Finding differentially expressed features (cluster biomarkers)**

If Seurat were a wizard, finding markers that define clusters via differential expression would be its signature spell - "Expecto Markerum!" By default, Seurat identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. But if you want to go on a more adventurous quest, FindAllMarkers() automates this process for all clusters.

```{r}
# find all markers of cluster 2
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
```
However, beware of the min.pct argument, which requires a feature to be detected at a minimum percentage in either of the two groups of cells. It's like requiring a hobbit to have at least 50% hairy feet before letting them join the fellowship. And don't forget the thresh.test argument, which requires a feature to be differentially expressed (on average) by some amount between the two groups. It's like requiring Legolas to be at least 1 foot taller than Frodo before he can join the fellowship.

If you're feeling brave, you can set both of these arguments to 0, but be prepared for a long and treacherous journey through a large number of features that are unlikely to be highly discriminatory. Alternatively, you can use the max.cells.per.ident argument to downsample each identity class to have no more cells than whatever this is set to. It's like asking Gandalf to use his magic to shrink the fellowship down to a more manageable size. Yes, there may be a loss in power, but the speed increases can be significant, and the most highly differentially expressed features will likely still rise to the top.

```{r}
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)

```

```{r}
# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>%
    group_by(cluster) %>%
    slice_max(n = 2, order_by = avg_log2FC)
```
```{r}
cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
```

```{r}
VlnPlot(pbmc, features = c("S100A8", "S100A9"))

```
```{r}
# you can plot raw counts as well
VlnPlot(pbmc, features = c("S100A8", "S100A9"), slot = "counts", log = TRUE)
```

```{r}
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
```
```{r}
FeaturePlot(pbmc, features = c("MS4A1", "GNLY"))
```

```{r}
pbmc.markers %>%
    group_by(cluster) %>%
    top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
```