Update vignette.Rmd

xuranw · Jul 24, 2018 · a715f15 · a715f15
1 parent 27a51b7
commit a715f15
Showing 1 changed file with 122 additions and 2 deletions.
diff --git a/vignettes/vignette.Rmd b/vignettes/vignette.Rmd
@@ -199,10 +199,130 @@ ggplot(m.prop.ana.h, aes(HbA1c, ct.prop)) + geom_smooth(method = 'lm',  se = FAL
 ```
 ![](GSE50244beta.jpeg)
 
+## MuSiC Estimation Example: Mouse/Rat Kidney
+We reproduce here the analysis detailed in MuSiC manuscript: 
+
+- Bulk RNA-seq data gene expression data were obtained from GEO dataset [GSE81492](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81492), which contains read counts data on bulk mRNA-seq of mouse kidney from 10 mouses, 6 healthy and 4 APOL1 mouses (Beckerman et al. 2017). 
+
+- Single cell RNA-seq gene expression data were obtained from GEO dataset [GSE107585](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107585), which contains read counts data on single cell mRNA-seq of mouse kidney from 7 male healthy mouse (Park et al. 2018). We used a [subset](https://github.com/xuranw/MuSiC/tree/master/vignettes/data) of single cell data as reference in deconvolution.
+
+### Data Preparation
+#### Bulk data
+
+The dataset [GEO entry (GSE81492)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81492) contains raw RNA-seq and sample annotation data. For the purpose of this vignette, we will use the read counts data `Mousebulkeset.rds` on the [this page](https://github.com/xuranw/MuSiC/tree/master/vignettes/data). Please download this file.
+
+``` r
+# Download Mouse bulk dataset from Github
+Mouse.bulk.eset = readRDS('yourpath/Mousebulkeset.rds')
+Mouse.bulk.eset
+
+#ExpressionSet (storageMode: lockedEnvironment)
+#assayData: 19033 features, 10 samples 
+#  element names: exprs 
+#protocolData: none
+#phenoData
+#  sampleNames: control.NA.27 control.NA.30 ... APOL1.GNA78M (10 total)
+#  varLabels: sampleID SubjectName
+#  varMetadata: labelDescription
+#featureData: none
+#experimentData: use 'experimentData(object)'
+#Annotation:
+```
+
+#### Single cell data
+
+The single cell data are from [GEO entry (GSE107585)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107585), which contrains read counts for 16273 genes of 43745 cells. A subset of the read counts `Mousesubeset.rds` are available on the [this page](https://github.com/xuranw/MuSiC/tree/master/vignettes/data), in the form of an `ExpressionSet`. This subset contains 16273 genes and 10000 cells. Please download this file. 
+
+``` r
+# Download EMTAB single cell dataset from Github
+Mousesub.eset = readRDS('yourpath/Mousesubeset.rds')
+Mousesub.eset
+
+#ExpressionSet (storageMode: lockedEnvironment)
+#assayData: 16273 features, 10000 samples 
+#  element names: exprs 
+#protocolData: none
+#phenoData
+#  sampleNames: TGGTTCCGTCGGCTCA-2 CGAGCCAAGCGTCAAG-4 ... GAGCAGAGTCAACATC-1 (10000 total)
+#  varLabels: sampleID SubjectName cellTypeID cellType
+#  varMetadata: labelDescription
+#featureData: none
+#experimentData: use 'experimentData(object)'
+#Annotation:  
+```
+Notice that the single cell dataset has 16 cell types, including 2 novel cell types and a transition cell type. We exclude those 3
+### Estimation of cell type proportions
+
+#### Cluster of Single Cell Data
+In the estimation procedure, the first step is to produce design matrix, cross-subject mean of relative abundance,  cross-subject variance of relative abundance and average library size from single cell reference. `music_basis` is the function to produce the design matrix (`Disgn.mtx`), cross-subject mean of relative abundance (`M.theta`), cross-subject variance of relative abundance (`Sigma`), and average library size (`M.S`). 
+
+MuSiC utilized all genes in the next estimation steps and this will lead to collineary of design matrix.
+```r
+# Produce the first step information
+Mousesub.basis = music_basis(Mousesub.eset, clusters = 'cellType', samples = 'sampleID', 
+                             select.ct = c('Endo', 'Podo', 'PT', 'LOH', 'DCT', 'CD-PC', 'CD-IC', 'Fib', 'Macro', 'Neutro',
+                                           'B lymph', 'T lymph', 'NK'))
+
+# Plot the dendrogram of design matrix and cross-subject mean of realtive abundance
+par(mfrow = c(1, 2))
+d <- dist(t(log(Mousesub.basis$Disgn.mtx + 1e-6)), method = "euclidean")
+# Hierarchical clustering using Complete Linkage
+hc1 <- hclust(d, method = "complete" )
+# Plot the obtained dendrogram
+plot(hc1, cex = 0.6, hang = -1, main = 'Cluster log(Design Matrix)')
+d <- dist(t(log(Mousesub.basis$M.theta + 1e-8)), method = "euclidean")
+# Hierarchical clustering using Complete Linkage
+# hc2 <- hclust(d, method = "complete" )
+hc2 <- hclust(d, method = "complete")
+# Plot the obtained dendrogram
+plot(hc2, cex = 0.6, hang = -1, main = 'Cluster log(Mean of RA)')
+```
+![](Cluster.jpg)
+
+The immune cells are clustered together and so are the epithelial cells. Notice that DCT and PT are close to each other. The cut-off is user determined. Here we cut 13 cell types into 4 groups:
+
+ - Group 1: Neutro
+ - Group 2: Podo
+ - Group 3: Endo, CD-PC, CD-IC, LOH, DCT, PT
+ - Group 4: Fib, Macro, NK, B lymph, T lymph
+
+MuSiC.cluster is an application of MuSiC to a hierarchical structure, which include 2 steps:
+
+- Step 1. Estimate proportions of each cluster; $(p_1,p_2,p_3,p_4)$
+- Step 2. Estimate cell type proportions within each clusters; $(p_{31},p_{32},.,p_{36},p_{41},.,p_{45})$
+
+We manually specify the cluster and annotated single cell data with cluster information. 
+```r
+clusters.type = list(C1 = 'Neutro', C2 = 'Podo', C3 = c('Endo', 'CD-PC', 'LOH', 'CD-IC', 'DCT', 'PT'), C4 = c('Macro', 'Fib', 'B lymph', 'NK', 'T lymph'))
+
+cl.type = as.character(Mousesub.eset$cellType)
+
+for(cl in 1:length(clusters.type)){
+  cl.type[cl.type %in% clusters.type[[cl]]] = names(clusters.type)[cl]
+}
+pData(Mousesub.eset)$clusterType = factor(cl.type, levels = c(names(clusters.type), 'CD-Trans', 'Novel1', 'Novel2'))
+
+# 13 selected cell types
+s.mouse = unlist(clusters.type)
+```
+We then select genes that differential expressed genes within cluster `C3` (Epithelial cells) and `C4` (Immune cells). Function `music_prop.cluster` is used for estimation with cluster.
+
+```r
+load('yourpath/IEmarkers.RData')
+
+Est.mouse.bulk = music_prop.cluster(bulk.eset = Mouse.bulk.eset, sc.eset = Mousesub.eset, group.markers = IEmarkers, clusters = 'cellType', group = 'clusterType', samples = 'sampleID', clusters.type = clusters.type)
+```
+
+
+The results might be different from the one presented in the manuscript due to incomplete reference single cell dataset.
+
+
 Reference
-------------
-* Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell metabolism 24, 593-607 (2016).
+----------
+* Segerstolpe, ?. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell metabolism 24, 593-607 (2016).
 * Xin, Y. et al. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell metabolism 24, 608-615 (2016).
 * Fadista, J. et al. Global genomic and transcriptomic analysis of human pancreatic islets reveals novel genes influencing glucose metabolism. Proceedings of the National Academy of Sciences 111, 13924-13929 (2014).
 * Newman, A.M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nature methods 12, 453 (2015).
 * Baron, M. et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst 3, 346-360 e344 (2016).
+* Park, J. et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science, eaar2131 (2018).
+* Beckerman, P. et al. Transgenic expression of human APOL1 risk variants in podocytes induces kidney disease in mice. Nat Med 23, 429-438 (2017).