diff --git a/vignettes/vignette.Rmd b/vignettes/vignette.Rmd index 27d72eb..53e9e61 100644 --- a/vignettes/vignette.Rmd +++ b/vignettes/vignette.Rmd @@ -199,10 +199,130 @@ ggplot(m.prop.ana.h, aes(HbA1c, ct.prop)) + geom_smooth(method = 'lm', se = FAL ``` ![](GSE50244beta.jpeg) +## MuSiC Estimation Example: Mouse/Rat Kidney +We reproduce here the analysis detailed in MuSiC manuscript: + +- Bulk RNA-seq data gene expression data were obtained from GEO dataset [GSE81492](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81492), which contains read counts data on bulk mRNA-seq of mouse kidney from 10 mouses, 6 healthy and 4 APOL1 mouses (Beckerman et al. 2017). + +- Single cell RNA-seq gene expression data were obtained from GEO dataset [GSE107585](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107585), which contains read counts data on single cell mRNA-seq of mouse kidney from 7 male healthy mouse (Park et al. 2018). We used a [subset](https://github.com/xuranw/MuSiC/tree/master/vignettes/data) of single cell data as reference in deconvolution. + +### Data Preparation +#### Bulk data + +The dataset [GEO entry (GSE81492)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81492) contains raw RNA-seq and sample annotation data. For the purpose of this vignette, we will use the read counts data `Mousebulkeset.rds` on the [this page](https://github.com/xuranw/MuSiC/tree/master/vignettes/data). Please download this file. + +``` r +# Download Mouse bulk dataset from Github +Mouse.bulk.eset = readRDS('yourpath/Mousebulkeset.rds') +Mouse.bulk.eset + +#ExpressionSet (storageMode: lockedEnvironment) +#assayData: 19033 features, 10 samples +# element names: exprs +#protocolData: none +#phenoData +# sampleNames: control.NA.27 control.NA.30 ... APOL1.GNA78M (10 total) +# varLabels: sampleID SubjectName +# varMetadata: labelDescription +#featureData: none +#experimentData: use 'experimentData(object)' +#Annotation: +``` + +#### Single cell data + +The single cell data are from [GEO entry (GSE107585)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107585), which contrains read counts for 16273 genes of 43745 cells. A subset of the read counts `Mousesubeset.rds` are available on the [this page](https://github.com/xuranw/MuSiC/tree/master/vignettes/data), in the form of an `ExpressionSet`. This subset contains 16273 genes and 10000 cells. Please download this file. + +``` r +# Download EMTAB single cell dataset from Github +Mousesub.eset = readRDS('yourpath/Mousesubeset.rds') +Mousesub.eset + +#ExpressionSet (storageMode: lockedEnvironment) +#assayData: 16273 features, 10000 samples +# element names: exprs +#protocolData: none +#phenoData +# sampleNames: TGGTTCCGTCGGCTCA-2 CGAGCCAAGCGTCAAG-4 ... GAGCAGAGTCAACATC-1 (10000 total) +# varLabels: sampleID SubjectName cellTypeID cellType +# varMetadata: labelDescription +#featureData: none +#experimentData: use 'experimentData(object)' +#Annotation: +``` +Notice that the single cell dataset has 16 cell types, including 2 novel cell types and a transition cell type. We exclude those 3 +### Estimation of cell type proportions + +#### Cluster of Single Cell Data +In the estimation procedure, the first step is to produce design matrix, cross-subject mean of relative abundance, cross-subject variance of relative abundance and average library size from single cell reference. `music_basis` is the function to produce the design matrix (`Disgn.mtx`), cross-subject mean of relative abundance (`M.theta`), cross-subject variance of relative abundance (`Sigma`), and average library size (`M.S`). + +MuSiC utilized all genes in the next estimation steps and this will lead to collineary of design matrix. +```r +# Produce the first step information +Mousesub.basis = music_basis(Mousesub.eset, clusters = 'cellType', samples = 'sampleID', + select.ct = c('Endo', 'Podo', 'PT', 'LOH', 'DCT', 'CD-PC', 'CD-IC', 'Fib', 'Macro', 'Neutro', + 'B lymph', 'T lymph', 'NK')) + +# Plot the dendrogram of design matrix and cross-subject mean of realtive abundance +par(mfrow = c(1, 2)) +d <- dist(t(log(Mousesub.basis$Disgn.mtx + 1e-6)), method = "euclidean") +# Hierarchical clustering using Complete Linkage +hc1 <- hclust(d, method = "complete" ) +# Plot the obtained dendrogram +plot(hc1, cex = 0.6, hang = -1, main = 'Cluster log(Design Matrix)') +d <- dist(t(log(Mousesub.basis$M.theta + 1e-8)), method = "euclidean") +# Hierarchical clustering using Complete Linkage +# hc2 <- hclust(d, method = "complete" ) +hc2 <- hclust(d, method = "complete") +# Plot the obtained dendrogram +plot(hc2, cex = 0.6, hang = -1, main = 'Cluster log(Mean of RA)') +``` +![](Cluster.jpg) + +The immune cells are clustered together and so are the epithelial cells. Notice that DCT and PT are close to each other. The cut-off is user determined. Here we cut 13 cell types into 4 groups: + + - Group 1: Neutro + - Group 2: Podo + - Group 3: Endo, CD-PC, CD-IC, LOH, DCT, PT + - Group 4: Fib, Macro, NK, B lymph, T lymph + +MuSiC.cluster is an application of MuSiC to a hierarchical structure, which include 2 steps: + +- Step 1. Estimate proportions of each cluster; $(p_1,p_2,p_3,p_4)$ +- Step 2. Estimate cell type proportions within each clusters; $(p_{31},p_{32},.,p_{36},p_{41},.,p_{45})$ + +We manually specify the cluster and annotated single cell data with cluster information. +```r +clusters.type = list(C1 = 'Neutro', C2 = 'Podo', C3 = c('Endo', 'CD-PC', 'LOH', 'CD-IC', 'DCT', 'PT'), C4 = c('Macro', 'Fib', 'B lymph', 'NK', 'T lymph')) + +cl.type = as.character(Mousesub.eset$cellType) + +for(cl in 1:length(clusters.type)){ + cl.type[cl.type %in% clusters.type[[cl]]] = names(clusters.type)[cl] +} +pData(Mousesub.eset)$clusterType = factor(cl.type, levels = c(names(clusters.type), 'CD-Trans', 'Novel1', 'Novel2')) + +# 13 selected cell types +s.mouse = unlist(clusters.type) +``` +We then select genes that differential expressed genes within cluster `C3` (Epithelial cells) and `C4` (Immune cells). Function `music_prop.cluster` is used for estimation with cluster. + +```r +load('yourpath/IEmarkers.RData') + +Est.mouse.bulk = music_prop.cluster(bulk.eset = Mouse.bulk.eset, sc.eset = Mousesub.eset, group.markers = IEmarkers, clusters = 'cellType', group = 'clusterType', samples = 'sampleID', clusters.type = clusters.type) +``` + + +The results might be different from the one presented in the manuscript due to incomplete reference single cell dataset. + + Reference ------------- -* Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell metabolism 24, 593-607 (2016). +---------- +* Segerstolpe, ?. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell metabolism 24, 593-607 (2016). * Xin, Y. et al. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell metabolism 24, 608-615 (2016). * Fadista, J. et al. Global genomic and transcriptomic analysis of human pancreatic islets reveals novel genes influencing glucose metabolism. Proceedings of the National Academy of Sciences 111, 13924-13929 (2014). * Newman, A.M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nature methods 12, 453 (2015). * Baron, M. et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst 3, 346-360 e344 (2016). +* Park, J. et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science, eaar2131 (2018). +* Beckerman, P. et al. Transgenic expression of human APOL1 risk variants in podocytes induces kidney disease in mice. Nat Med 23, 429-438 (2017).