Skip to content

Commit

Permalink
Update vignette.Rmd
Browse files Browse the repository at this point in the history
  • Loading branch information
xuranw authored Jul 24, 2018
1 parent 27a51b7 commit a715f15
Showing 1 changed file with 122 additions and 2 deletions.
124 changes: 122 additions & 2 deletions vignettes/vignette.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -199,10 +199,130 @@ ggplot(m.prop.ana.h, aes(HbA1c, ct.prop)) + geom_smooth(method = 'lm', se = FAL
```
![](GSE50244beta.jpeg)

## MuSiC Estimation Example: Mouse/Rat Kidney
We reproduce here the analysis detailed in MuSiC manuscript:

- Bulk RNA-seq data gene expression data were obtained from GEO dataset [GSE81492](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81492), which contains read counts data on bulk mRNA-seq of mouse kidney from 10 mouses, 6 healthy and 4 APOL1 mouses (Beckerman et al. 2017).

- Single cell RNA-seq gene expression data were obtained from GEO dataset [GSE107585](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107585), which contains read counts data on single cell mRNA-seq of mouse kidney from 7 male healthy mouse (Park et al. 2018). We used a [subset](https://github.com/xuranw/MuSiC/tree/master/vignettes/data) of single cell data as reference in deconvolution.

### Data Preparation
#### Bulk data

The dataset [GEO entry (GSE81492)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81492) contains raw RNA-seq and sample annotation data. For the purpose of this vignette, we will use the read counts data `Mousebulkeset.rds` on the [this page](https://github.com/xuranw/MuSiC/tree/master/vignettes/data). Please download this file.

``` r
# Download Mouse bulk dataset from Github
Mouse.bulk.eset = readRDS('yourpath/Mousebulkeset.rds')
Mouse.bulk.eset

#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 19033 features, 10 samples
# element names: exprs
#protocolData: none
#phenoData
# sampleNames: control.NA.27 control.NA.30 ... APOL1.GNA78M (10 total)
# varLabels: sampleID SubjectName
# varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:
```

#### Single cell data

The single cell data are from [GEO entry (GSE107585)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107585), which contrains read counts for 16273 genes of 43745 cells. A subset of the read counts `Mousesubeset.rds` are available on the [this page](https://github.com/xuranw/MuSiC/tree/master/vignettes/data), in the form of an `ExpressionSet`. This subset contains 16273 genes and 10000 cells. Please download this file.

``` r
# Download EMTAB single cell dataset from Github
Mousesub.eset = readRDS('yourpath/Mousesubeset.rds')
Mousesub.eset

#ExpressionSet (storageMode: lockedEnvironment)
#assayData: 16273 features, 10000 samples
# element names: exprs
#protocolData: none
#phenoData
# sampleNames: TGGTTCCGTCGGCTCA-2 CGAGCCAAGCGTCAAG-4 ... GAGCAGAGTCAACATC-1 (10000 total)
# varLabels: sampleID SubjectName cellTypeID cellType
# varMetadata: labelDescription
#featureData: none
#experimentData: use 'experimentData(object)'
#Annotation:
```
Notice that the single cell dataset has 16 cell types, including 2 novel cell types and a transition cell type. We exclude those 3
### Estimation of cell type proportions

#### Cluster of Single Cell Data
In the estimation procedure, the first step is to produce design matrix, cross-subject mean of relative abundance, cross-subject variance of relative abundance and average library size from single cell reference. `music_basis` is the function to produce the design matrix (`Disgn.mtx`), cross-subject mean of relative abundance (`M.theta`), cross-subject variance of relative abundance (`Sigma`), and average library size (`M.S`).

MuSiC utilized all genes in the next estimation steps and this will lead to collineary of design matrix.
```r
# Produce the first step information
Mousesub.basis = music_basis(Mousesub.eset, clusters = 'cellType', samples = 'sampleID',
select.ct = c('Endo', 'Podo', 'PT', 'LOH', 'DCT', 'CD-PC', 'CD-IC', 'Fib', 'Macro', 'Neutro',
'B lymph', 'T lymph', 'NK'))

# Plot the dendrogram of design matrix and cross-subject mean of realtive abundance
par(mfrow = c(1, 2))
d <- dist(t(log(Mousesub.basis$Disgn.mtx + 1e-6)), method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1, main = 'Cluster log(Design Matrix)')
d <- dist(t(log(Mousesub.basis$M.theta + 1e-8)), method = "euclidean")
# Hierarchical clustering using Complete Linkage
# hc2 <- hclust(d, method = "complete" )
hc2 <- hclust(d, method = "complete")
# Plot the obtained dendrogram
plot(hc2, cex = 0.6, hang = -1, main = 'Cluster log(Mean of RA)')
```
![](Cluster.jpg)

The immune cells are clustered together and so are the epithelial cells. Notice that DCT and PT are close to each other. The cut-off is user determined. Here we cut 13 cell types into 4 groups:

- Group 1: Neutro
- Group 2: Podo
- Group 3: Endo, CD-PC, CD-IC, LOH, DCT, PT
- Group 4: Fib, Macro, NK, B lymph, T lymph

MuSiC.cluster is an application of MuSiC to a hierarchical structure, which include 2 steps:

- Step 1. Estimate proportions of each cluster; $(p_1,p_2,p_3,p_4)$
- Step 2. Estimate cell type proportions within each clusters; $(p_{31},p_{32},.,p_{36},p_{41},.,p_{45})$

We manually specify the cluster and annotated single cell data with cluster information.
```r
clusters.type = list(C1 = 'Neutro', C2 = 'Podo', C3 = c('Endo', 'CD-PC', 'LOH', 'CD-IC', 'DCT', 'PT'), C4 = c('Macro', 'Fib', 'B lymph', 'NK', 'T lymph'))

cl.type = as.character(Mousesub.eset$cellType)

for(cl in 1:length(clusters.type)){
cl.type[cl.type %in% clusters.type[[cl]]] = names(clusters.type)[cl]
}
pData(Mousesub.eset)$clusterType = factor(cl.type, levels = c(names(clusters.type), 'CD-Trans', 'Novel1', 'Novel2'))

# 13 selected cell types
s.mouse = unlist(clusters.type)
```
We then select genes that differential expressed genes within cluster `C3` (Epithelial cells) and `C4` (Immune cells). Function `music_prop.cluster` is used for estimation with cluster.

```r
load('yourpath/IEmarkers.RData')

Est.mouse.bulk = music_prop.cluster(bulk.eset = Mouse.bulk.eset, sc.eset = Mousesub.eset, group.markers = IEmarkers, clusters = 'cellType', group = 'clusterType', samples = 'sampleID', clusters.type = clusters.type)
```


The results might be different from the one presented in the manuscript due to incomplete reference single cell dataset.


Reference
------------
* Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell metabolism 24, 593-607 (2016).
----------
* Segerstolpe, ?. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell metabolism 24, 593-607 (2016).
* Xin, Y. et al. RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell metabolism 24, 608-615 (2016).
* Fadista, J. et al. Global genomic and transcriptomic analysis of human pancreatic islets reveals novel genes influencing glucose metabolism. Proceedings of the National Academy of Sciences 111, 13924-13929 (2014).
* Newman, A.M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nature methods 12, 453 (2015).
* Baron, M. et al. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst 3, 346-360 e344 (2016).
* Park, J. et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science, eaar2131 (2018).
* Beckerman, P. et al. Transgenic expression of human APOL1 risk variants in podocytes induces kidney disease in mice. Nat Med 23, 429-438 (2017).

0 comments on commit a715f15

Please sign in to comment.