If you are reading this blog that means you are already familiar with the principle behind the single cell RNA Seq(scRNA-Seq). You might have some scRNA-Seq data in your hand which you would like to analyze. I will walk you through some of the currect best practices in the scRNA-Seq data analysis. Most of the materials that will be discussed in this tutorial are available in Seurat website in a short format. I will try to explain why we are doing every steps. Now let start analysizing some scRNA-Seq data. That that I will use can be downloaded from here.
First we need to loads R packages that we will be using.

library(dplyr)
library(Seurat)
library(patchwork)
library(ggplot2)

Pre-processing scRNA-Seq data

Raw data generated by sequencing machine are aligned to the genome or transcriptome and count the number of reads belong to transcripts. In order to reduce the effect from PCR duplication every mRNA in the cell are tagged with unique molecular identifier (UMI) then counted number of the UMI belong to each transcript. There are different softwares for pre-processing raw data, 10X Genomics use a command line software called cellrangr in order to process the raw data and generate the UMI count matrix. Cellranger produce a UMI count matrix where row corespond to the number of RNA molecule for a feature and each column correspond to a cell.

Reading raw data

After unziping count data file from the above link. I will load count data in R by Read10X function of Seurat and creat a Seurat Object.

pbmc.data <- Read10X(data.dir = "~/Downloads/filtered_gene_bc_matrices/hg19/")
#Initialize the Seurat object with the count data that loaded by Read10X function
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

Quality control of count data

Before analysing the scRNA-Seq data, we need to make sure that reads we are analysing correspond to viable cells. There are few different quality control (QC) that we can perform to select only the viable cells for further analysis. Commonly used QC matrices are: 1. Number of total counts per cell. We can examine the histogram total number count per cell in order to identify the oulier peak that we can use as a threshold to filter the low quality data.

hist(pbmc$nCount_RNA, breaks = 1000)

2. Number of gene expressed per cell. We can examine the histogram of number of genes expressed per cell.

hist(pbmc$nFeature_RNA, breaks = 1000)

hist(pbmc$nCount_RNA, breaks = 1000, xlim = c(0,2000))

3. The fraction of counts from mitocondrial genes. We use PercentageFeatureSet function to calculate percentage of reads thats belong to mitocondrial genes. We make a metadata column in Seurate object with percentage of reads belongs to mitocondrial genes.

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
hist(pbmc$percent.mt, breaks = 1000)

What we are doing is that looking at all QC matrices indivisually. We can use VlnPlot function of seurate to look at distribution of all three QC matrices together.

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

We can use scater plot to visualize relationship between features in QC matrices.

ggplot(pbmc@meta.data, aes(x=nCount_RNA, y=nFeature_RNA)) + geom_point() + xlab("Count Depth") + ylab("Number of genes")

ggplot(pbmc@meta.data, aes(x=nCount_RNA, y=percent.mt)) + geom_point() + xlab("Count Depth") + ylab("Percentage of mitocondiral gene reads")

ggplot(pbmc@meta.data, aes(x=nCount_RNA, y=nFeature_RNA,color = ifelse(percent.mt < 5, "Fail", "Pass"))) +      geom_point() + xlab("Count Depth") + ylab("Number of genes")

In all the above scatter plots we can see that cells with higher percentage of mitocondrial genes reads have lower number of non-mitocondrial genes expressed and lower number of total read counts. High percentage of mitocondrial genes with low count depth and lower number of genes expression is an idicative of the disruption of the cell membrane and leakage of cytoplasmic RNA thus only RNA remains is mitocondrial RNA.
On the other hand higher read depth and highier number of genes expressed represent doublets cells. High count depth and highier number of expressed gene can be used to filter doublets in scRNA-Seq data. Never filter your data based of the distribution of single QC matrice because these QC could also have some biological significance. QC matrices we are using here such as high percentage of mitocondrial RNA might represent highier respiration rate. Cell with lower count depth or lower number of genes expression might represent cell in quiescent stat. Highier count depth and gene expression might represent larger cell size. Because of these biological significance of our QC matrices we alway should look at the QC matrices jointly by examining their scatter plots. By looking at the scatter plots above we decided to filter data based on criteria such as lowest and highest number of genes expression and percentage of mitocondrial RNA. We filter out data expresss at less than 200 genes and more than 2500 genes. We also filter out cells that has more than 5 percent of the mitocondrial RNA.

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

We should also change these filtering parameter based on QC matrices of your own data.

Normalization

Each scRNA-Seq experiment involve multiple steps such collection of RNA from signle cell, reverse transcription, sequencing. Every step can contribute to the variability to the total amount of RNA sequenced from a cell. In order to compare gene expression between cells we need to consider this technical variability. Various normalization methods can be used in order to accounts for this variability. Seurat use a global scaling normalization methods where it is assume that all cells in dataset initially contains equal number of RNA and the variability of count depth arises from the technical variability of the sample preparation and sequencing. In this method expression value of each RNA in a specific cell is normalize by the count depth of the that cell. The value then multiplied by 10000 and log transform. log transformation reduce the mean variance relationship and reduce the skewness in the data which is the assumtion for many downstream analysis that data is normaly distributed.

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

There are more methods of normalization which I will discuss sometime later. In this tutorial we are only processing a single data sets there I am not going to talk about batch effect and data correction, which you might need to consider if you have multiple datasets generated using multiple technologies.

Feature Selection

Goal of the scRNA-Seq experiment is to measure expression of all 25000 genes in human genome. In many of the cells at lot these genes will not be expressed therefore will have zero count and a large number of these genes will not be informative. Even after removing genes wilth low number of count, the dataset still have very high dimentionality such as 1500 or more genes. In order to reduce noise in the dataset we can select genes that are highly variable among cells by using FindVariableFeatures function of Seurat. This function perform following steps to rank genes with high variability. 1. Calculate expected variance from a polynomial fit of raw mean and variance. 2. Standardize the count data using the expected variance and raw mean and raw of value of the count. 3. It then calculate variance of standardize value and use this variance to rank genes.

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

In the above code we selected top 2000 genes from the rank based on variance. Lets look at the variable genes by plotting mean and standard variance in scatter plot.

plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = head(VariableFeatures(pbmc), 10), repel = TRUE)
## Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.
## Please use `as_label()` or `as_name()` instead.
## This warning is displayed once per session.
plot1 + plot2
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis

Scalling data

Scalling linearly transform the data such that the mean of counts from each gene is 0 and variance is 1. R command below will transform 2000 highly variable genes.

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

Dimentionality reduction

After feature selection and scalling we can use dedicated dimentionality reduction algorithm to further reduce dimentionality in our data. Goal of the dementionality reduction algorithm is to express variability in our data as low dimention as possible. There are two main objectives of dimentionality reduction such as visualization and summarization. Explaining data in 2 or 3 dimention help to visualize them in scatter plot. Summarizing data in low dimentions make earier to perform further analysis. Two different dimentionality reduction methods can be used in scRNA-Seq data analysis.

Linear Dimentionality reduction

Principle Components Analysis(PCA) is the most commonly used linear dimentionality reduction methods in scRNA-Seq data analysis. PCA summarize the variablity in data set using certain number of dimentions. The number of dimentions that explain maximum variability in datasets can be found by “elbow” heuristics or the permutation‐test‐based jackstraw method. Seurat has RunPCA function that can perform PCA on features of your interests.

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

Seurat has some buildin function such as VizDimReduction, DimPlot, and DimHeatmap that we can use to vidualize our data in lower dimension.

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

DimPlot(pbmc, reduction = "pca")

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

You can change dims and cells parameter see what happen. Lets make 15 hatmap for 15 different dimentions.

DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)

Heatmap of lower dimension produce very nice cluster for cells wherease higher dimention does not. The PC that produce nice cluster we can them for further analysis. Therefore, this heatmap can be used to make decision about number of PCs to use for future analysis.

How many components we should choose for further analysis?

As I said earlier we use two different methods such as jackstraw and elbow to determine the appropriate number of dimentions for further analysis. Seurat has has buildin function based on both methods that we can use to decide number of dimentions for further analysis.

pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)

We can plot each dimention and see how they perform.

JackStrawPlot(pbmc, dims = 1:15)
## Warning: Removed 23496 rows containing missing values (geom_point).

We can see that there is a sharp drop in significance after 10 and 11 PC. Another approach to determine numbers dimension in the data is by looking at the Elbow plot where number of principle coponents is plotted against the variance.

ElbowPlot(pbmc)

As the number of principle component increase variance reduce. We can choose the number of PCs that has lowest variability. Determining number of dimension in the data set for further analysis can be challenging. It would be good to consider all three methods for choosing number of dimensions.

Clustering Cells

Organizing cells into clusters is typical first intermediate results of any scRNA-Seq analysis. Cells are clustered based on the similarity of their gene expresssion measured via a distance matrix. This method use dimensionality reduced datasets as input and calculate similarity scoring based on PC-reduced expression space. Clustering can be performed using two different approaches such as clustering algorithm and community detection methods. Most common clustering algorithm we can use for scRNA-Seq data is k-means clustering. Most common community detection methods are graph-partitioning algorithm which rely on graph representation of scRNA-Seq data. This graph representation of dimensionally reduced RNA-Seq data can be obtained using k-nearest neighbor algorithm. Cells are represented as nodes in the graph. Each cell is connected to its K most similar cells, which are typically obtained using Euclidean distances on the PC‐reduced expression space. Depending on the size of the dataset, K is commonly set to be between 5 and 100 nearest neighbours. Based on selected number of PCs, we can now cluster cells using K-nearest neighbor then partition this graph into highly interconnected communities. We use Seurat function FindNeighbors to construct KNN graph from reduced PC space. In order to cluster the cells we use Seurat function FindClusters, which use modularity optimization techniques such as Louvain algorithm to group cells together.

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 96033
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds

Non-linear dimensionally reduction

To visualize scRNA-Seq data we can also use non-linear dimensionality reduction technique such as tSNE and UMAP and ploting in low dimensions. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.

pbmc <- RunUMAP(pbmc, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
DimPlot(pbmc, reduction = "umap")

Cluster annotation

Now we have clusters of cells in scRNA-Seq, now we would like to annotate cluster with cell types. We can use gene signatures of each cluster to annotate it with cell type. Identifying and annotating cluster relies on using external sources of information that describe the expected expression profile of individual cell type. We can obtained expected expression profile from the human cell atlas, which have gene expression data different type of human cells. If we do not have reference gene expression profile, we can use our data to derived marker genes which we can compare with literature-derived marker gene xpression value to determine cell type.

Identification of marker genes

Merker gene sets can be obtained by using differential expression testing between two groups: cells in one cluster and all other cells in datasets.We consider only the upregulated genes to identify cluster. We rank genes based on their difference in expression between two groups and top-ranked genes are considered as marker genes. Seurat function FindAllMarkers can be used to identify markers for all clusters. There are different option for this function you can change the options see what happen.

pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
## # A tibble: 18 x 7
## # Groups:   cluster [9]
##        p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene    
##        <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
##  1 1.96e-107     0.730 0.901 0.594 2.69e-103 0       LDHB    
##  2 1.61e- 82     0.922 0.436 0.11  2.20e- 78 0       CCR7    
##  3 7.95e- 89     0.892 0.981 0.642 1.09e- 84 1       LTB     
##  4 1.85e- 60     0.859 0.422 0.11  2.54e- 56 1       AQP3    
##  5 0.            3.86  0.996 0.215 0.        2       S100A9  
##  6 0.            3.80  0.975 0.121 0.        2       S100A8  
##  7 0.            2.99  0.936 0.041 0.        3       CD79A   
##  8 9.48e-271     2.49  0.622 0.022 1.30e-266 3       TCL1A   
##  9 2.96e-189     2.12  0.985 0.24  4.06e-185 4       CCL5    
## 10 2.57e-158     2.05  0.587 0.059 3.52e-154 4       GZMK    
## 11 3.51e-184     2.30  0.975 0.134 4.82e-180 5       FCGR3A  
## 12 2.03e-125     2.14  1     0.315 2.78e-121 5       LST1    
## 13 7.95e-269     3.35  0.961 0.068 1.09e-264 6       GZMB    
## 14 3.13e-191     3.69  0.961 0.131 4.30e-187 6       GNLY    
## 15 1.48e-220     2.68  0.812 0.011 2.03e-216 7       FCER1A  
## 16 1.67e- 21     1.99  1     0.513 2.28e- 17 7       HLA-DPB1
## 17 7.73e-200     5.02  1     0.01  1.06e-195 8       PF4     
## 18 3.68e-110     5.94  1     0.024 5.05e-106 8       PPBP

Seurat also includes various function for visualization of marker gene expression across clusters. For example we can use FeaturePlot function to visualize marker genes expression on a tSNE UMAP plots.

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", 
    "CD8A"))

We can also use DoHeatmap function to generate an expression heatmap for given cells and features.

top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

### Annotate cluster with cell type

new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()