RaceID is a method for cell type identification from single-cell RNA-seq data by unsupervised learning. An initial clustering is followed by an outlier identification based on a backgorund model of combined technical and biological variability in single-cell RNA-seq data obtained by quantification with unique molecular identifiers. StemID permits subsequent inference of a lineage tree based on clusters, i.e. cell types, identified by RaceID. The current version of RaceID (RaceID3) and StemID (StemID2) are published (Herman, Sagar, and Grün 2018). This package implements additional improvements in rare cell type detection, offers batch effect removal utilities, optional imputing of gene expression, and substantially decreases runtime as well as memory usage. We tested the method successfully on a dataset of ~50k cells. We now added another method termed VarID to the package, which allows the quantification of local gene expression variability, but furthermore offers an alternative clustering approach to RaceID based on density clustering which has certain advantages and, in particular, permits handling much larger datasets, since there is no requiremnet of storing a distance matrix. This method can be fully integrated into a RaceID analysis (see section "VarID"). RaceID offers several optional steps and here we will show examples of how to perform typical RaceID/StemID analyses.
Input to RaceID is a matrix of raw expression values (unique molecular identifiers with or without Poisson correction (Grün, Kester, and van Oudenaarden 2014)) with cells as column and genes as rows. This matrix can be provided as a matrix object, a data.frame or a sparse matrix produced by the Matrix
package.
The RaceID
package comes with sample data intestinalData
for murine intestinal epethelial cells stored in sparse matrix format. The dataset was published previously (Grün et al. 2016).
RaceID and StemID functions have various input and output parameters, which are explained in detail on the help
pages. Here, we mostly use default parameters, which represent a good choice for common datasets.
To start the analysis, a RaceID single-cell sequencing (SCseq
) object is initialized with a count matrix.
library(RaceID)
sc <- SCseq(intestinalData)
The first step is the application of filtering for the purpose of quality control. Cells with a relatively low total number of transcripts are discarded.
sc <- filterdata(sc,mintotal=2000)
In this example, we filter out cells with <2,000 transcipts. The function allows further filtering of genes by choosing the input parameters minexpr
and minnumber
, i.e. only genes with at least minexpr
transcripts in at least minnumber
cells are retained. The filtered and normalized expression matrix (normalized to the minimum total transcript count across all cells retained after filtering) can be retrieved by the getfdata
function:
fdata <- getfdata(sc)
The filterdata
function allows the detection and removal of batch effects by different methods as outlined below in addtional examples. Alternatively, individual genes or groups of genes correlating to specific genes can be removed with the FGenes
and CGenes
arguments. This frequently allows a minimally invasive removal of batch effects or of particularly highly expressed genes with an unwanted dominating effect on the clustering.
Next, the distance matrix is computed as the input for clustering and outlier identification. This can be done with or without imputing gene expression from nearest neighbours (see example below for imputing). RaceID offers various alternative metric, e.g. spearman, kendall, and euclidean, as well as measure of proportionality (rho and phi from the propr
package).
sc <- compdist(sc,metric="pearson")
This function computes the distance matrix based on highly variable genes by default. If all genes should be used, then the parameter FSelect
needs to be set to FALSE
. On this distance matrix clustering is performed:
sc <- clustexp(sc)
To infer the initial cluster number, this function computes the average within-cluster dispersion up to a number of clusters specified by the clustnr
arguments, which equals 30 by default. If more major populations are expected, this parameter needs to be set to higher values which will increase the run time. The initial cluster number only serves as a rough estimate, since the subsequent outlier identification will infer additional clusters. The internal inference of the cluster number and the evaluation of cluster stability by the computation of Jaccard's similarity is done on all cells by default. For large datasets it is reasonable to subsample a limited number of cells, by setting the samp
argument, e.g., to 1,000. In this case, the cluster number is inferred based on this smaller subset and Jacccard's similarity is not computed by bootstrapping but by sub-sampling. For k-medoids clustering, subsetting will imply almost deterministic clustering partitions, if the sample size approaches the size of the dataset. Therefore, samp
should be signicantly smaller then the size of the dataset. Otherwise, bootstrapping is better for assessing the cluster stability. Subsampling can be expected to give a good estimate of the number of major clusters. Additional small clusters which might have been missed by the sampling can be reintroduces during the outlier identification step.
The inferred cluster number can be inspected in a saturation plot, which shows the decrease of the average within-cluster dispersion with increasing cluster number. If this decrease becomes constant, saturation is reached. The automatically chosen cluster number is detected such that the decrease is equal to the decrease upon further increasing the cluster number within the error bars:
plotsaturation(sc,disp=FALSE)
The average within-cluster dispersion can also by plotted:
plotsaturation(sc,disp=TRUE)
The cluster stability as assessed by Jaccard's similarity should also be inspected:
plotjaccard(sc)
In this example, the automated criterion overestimated the cluster number leading to instability as indicated by low Jaccard's similarity. Based on visual inspection of the average within-cluster dispersion as a function of the cluster number, we manually set the cluster number to 7 without recomputing the saturation behaviour.
sc <- clustexp(sc,cln=7,sat=FALSE)
This function perform k-medoids clustering by default. K-means clustering or hierarchical clustering can be chosen with the FUNcluster
argument. For very large datasets, hierarchical clustering leads to significantly smaller run time.
Subsequently, outliers in the initial k-medoids clusters are identified based on an internally computed background model for the expected gene expression variability and the assumption that transcript counts follow a negative binomial distribution defined by the mean and the variance of the expression of each gene in a cluster. Outlier genes will be in the tail of this distribution at a p-value defined by the probthr
parameter (1e-3 by default), and outlier cells require the presence of a number of outlier genes defined by the outlg
parameter (2 by default).
sc <- findoutliers(sc)
In contrast to previous versions, outlier genes are inferred from non-normalized transcript counts, which follow a negative binomial distribution modeling the joint technical and biological variability. The assumption of a negative binomial distribution was demonstrated for raw transcript (UMI) count data, but is not strictly valid for normalized expression values (Grün, Kester, and van Oudenaarden 2014). Hence, RaceID does not require normalization, since the correlation-based metric for the computation of the distance object is also independent of the normalization. Normalizaion is only relevant when using, e.g., the euclidean metric for the derivation of the distance matrix. RaceID applies a simple size normalization for data representation and follow-up analyses.
The background noise model can be inspected:
plotbackground(sc)
The number of outliers as a function of the p-value can be plotted:
plotsensitivity(sc)
Another way of checking the presence of outliers is the inspection of a barplot of p-values across all cells in each cluster:
plotoutlierprobs(sc)
A heatmap of cell-to-cell distances grouped be the final clusters inferred based on the original clusters and the outliers allows visual inspection of the clusters:
clustheatmap(sc)
## [1] 5 4 11 7 9 1 2 3 10 6 8
This function is not recommended for very large datasets, since it produces similarly large plotting output.
The best way of visualising the detetcted cell types is plotting cells and clusters in a two-dimensional reduction representaion. RaceID can compute a t-SNE map
sc <- comptsne(sc)
or a k-nearest neighbour graph layout utilizing the Fruchterman-Rheingold algorithm:
sc <- compfr(sc,knn=10)
In this example, the number of nearest neighbours was chosen to be 10. In general, different values for knn
should be tested to find an ideal layout.
RaceID further allows computation of a umap dimensional reduction representation:
sc <- compumap(sc)
Parameters for umap can be given as additional argument (see help function).
The t-SNE map can be plotted by
plotmap(sc)
The Fruchterman-Rheingold layout can be plotted by
plotmap(sc,fr=TRUE)
The umap representation can be plotted by
plotmap(sc,um=TRUE)
Maps can be changed for both t-SNE and Fruchterman-Rheingold algorithm by initializing the rseed
argument of the comptsne
or compfr
function with a random number.
The dimensional reduction maps can be inspected, e.g., for the localization of (a subset of) samples included in the analysis:
types <- sub("(\\_\\d+)$","", colnames(sc@ndata))
subset <- types[grep("IV|V",types)]
plotsymbolsmap(sc,types,subset=subset,fr=TRUE)
Expression of a gene of interest or aggregated expression for a group of genes can be highlighted in the dimensional reduction representation:
plotexpmap(sc,"Lyz1",logsc=TRUE,fr=TRUE)
g <- c("Apoa1", "Apoa1bp", "Apoa2", "Apoa4", "Apoa5")
plotexpmap(sc,g,n="Apoa genes",logsc=TRUE,fr=TRUE)
It is also possible to highight expression only for a subset of cells, e.g. a particular batch or sample:
sample <- colnames(sc@ndata)[grep("^I5d",colnames(sc@ndata))]
plotexpmap(sc,"Lyz1",cells=sample,logsc=TRUE,fr=TRUE)
For the murine intestinal example data, inspetion of known marker genes suggests that cluster 2 and 3 correspnd to Lgr5-expressing intestinal stem cells. Cluster 2 is proliferative as indicated by up-regulation of Mki67, Cluster 1 comprises transiently amplifying progenitors biased towards absorptive entorytes in cluster 4 marked by Apolipoproteins. Cluster 7 comprises Lysozyme-expressing Paneth cells while Mucus-producing goblet cells constitute clusters 6 and 10. To inspect clusters in more detail, differentially expressed genes can be inferred by an internal approach akin to DESeq2 (Love, Huber, and Anders 2014) but with a dispersion estimated globally from the background model of RaceID.
For instance, to obtain differentially expressed genes within cluster 9 compared to all other cells:
d <- clustdiffgenes(sc,4,pvalue=.01)
dg <- d$dg
head(dg,25)
## mean.ncl mean.cl fc pv padj
## Apoa1 0.6362058 21.398348 33.634315 0.000000e+00 0.000000e+00
## Apoa4 0.5702229 13.588646 23.830412 0.000000e+00 0.000000e+00
## Fabp1 0.7444607 21.348669 28.676691 0.000000e+00 0.000000e+00
## Fabp2 1.6023268 27.105795 16.916521 0.000000e+00 0.000000e+00
## Reg1 0.2360083 6.856378 29.051431 0.000000e+00 0.000000e+00
## Reg3b 0.4229290 7.360573 17.403803 9.665345e-88 3.371594e-85
## Gstm3 0.4901265 7.645545 15.599126 2.408636e-85 7.201821e-83
## Rbp2 0.1837525 4.865970 26.481103 1.340014e-73 3.505811e-71
## Reg3g 0.5903128 6.381428 10.810248 4.488428e-63 1.043809e-60
## Plac8 2.2049822 12.625809 5.726037 1.040516e-56 2.177800e-54
## Spink3 0.1808018 3.908387 21.616969 6.661021e-56 1.267411e-53
## Sis 0.2079850 4.014474 19.301750 3.388549e-55 5.910194e-53
## Anpep 0.1337270 3.602236 26.937242 4.383642e-55 7.057664e-53
## Guca2b 0.4864784 5.159079 10.604950 1.758228e-54 2.628551e-52
## Slc5a1 0.3030379 4.362130 14.394669 9.178274e-54 1.280675e-51
## Ces2e 0.8445971 6.651895 7.875820 9.650062e-53 1.262349e-50
## Apoc2 0.2539176 4.025073 15.851887 4.568302e-51 5.391473e-49
## Aldob 1.4071637 8.711907 6.191111 4.636718e-51 5.391473e-49
## Aldh1a1 0.1080674 2.965912 27.445004 4.730617e-47 5.211148e-45
## Crip1 1.3463788 7.794775 5.789437 4.182806e-45 4.377307e-43
## Mep1b 0.1180375 2.923125 24.764369 7.826703e-45 7.800614e-43
## Mttp 0.3015435 3.758237 12.463333 4.122038e-43 3.921557e-41
## Cyp3a11 0.1544745 3.071999 19.886768 6.232458e-43 5.671537e-41
## Clec2h 0.1374732 2.930410 21.316236 1.195432e-42 1.042516e-40
## Ckmt1 0.7801578 5.336224 6.839929 1.340511e-42 1.122276e-40
The differentially expressed genes (in this example only the up-regulated ones with a fold change >1) can be plottted in a heatmap, which can highlight the clusters and samples of origin:
types <- sub("(\\_\\d+)$","", colnames(sc@ndata))
genes <- head(rownames(dg)[dg$fc>1],10)
plotmarkergenes(sc,genes,samples=types)
The heatmap can also be ordered by cell names (i.e. by batch or sample) by setting order.cells
to TRUE
. With the input parameters cl
and cells
, the heatmap can be restricted to a subset of cluster or cells, respectively.
plotmarkergenes(sc,genes,cl=c(2,6,7,8,10),samples=types,order.cells=TRUE)
In this example, no inter-sample differences are apparent and all samples contribute to each cluster.
It is also possible to plot gene expression across clusters or samples using a dot plot, where the diameter represents the fraction of cells in a sample or cluster expressing a gene, and the color indicates the expression level or z-score.
For example, the following commmand plots expression z-scores across clusters 2,6,7,8, and 10:
fractDotPlot(sc, genes, cluster=c(2,6,7,8,10), zsc=TRUE)
Expression on a log2 scale across samples I, II, and III can be plotted by
samples <- sub("(\\d.+)$","", colnames(sc@ndata))
fractDotPlot(sc, genes, samples=samples, subset=c("I","II","III"), logscale=TRUE)
A differential gene expression analysis between two defined sets of cell, e.g., two (groups of) clusters can be performed:
A <- names(sc@cpart)[sc@cpart %in% c(2,4)]
B <- names(sc@cpart)[sc@cpart %in% c(3)]
x <- diffexpnb(getfdata(sc,n=c(A,B)), A=A, B=B )
plotdiffgenesnb(x,pthr=.05,lthr=.5,mthr=-1,Aname="Cl.2",Bname="Cl.3,5",show_names=TRUE,padj=TRUE)
See the paragraphs below for additional options of RaceID analyses and parameter choices ideal for analysing large datasets.
StemID is an algorithm for the inference of lineage trees and differentiation trajectories based on pseudo-temporal ordering of single-cell transcriptomes. It utilizies the clusters predicted by RaceID and thus requires to run RaceID first. The algorithm was originally published along with RaceID2 (Grün et al. 2016) and the improved current version StemID2 was published later (Herman, Sagar, and Grün 2018). In a nutshell, StemID infers links between clusters which are more populated with intermediate single-cell transcriptomes than expected by chance. To assign cells to inter-cluster links, two fundamentally different strategies are available (see nmode
argument below). The first strategy considers the projection of a vector connecting a cluster medoid to a member cell of the same cluster onto the links from the medoid of its cluster to the medoids of all other clusters. The longest projection identifies the link this cell is assigned to and defines the projection coordinate. The second (nearest neighbour) mode identifies for a cell in a given cluster the number of k nearest neighbours in each other cluster and assigns the cell to the link with the cluster where the average distance to these k nearest neighbours is minimized. The coordinate on the link is inferred as in the first mode. A faster approximate version of the first mode is also implemented.
As a first step, a lineage tree object for the StemID analysis needs to be initialized with an SCseq object obtained from a RaceID analysis:
ltr <- Ltree(sc)
Next, the transcriptome entropy of cell needs to be calculated. This is used by the StemID algorithm for the prediction of the stem cell type, based on maximum transcriptome entropy and maximum number of links to other clusters.
ltr <- compentropy(ltr)
In the subsequent step, cells are projected onto inter-cluster links. Cells are assigned to a link based on minimum distance to k nearest neighbours (nmode=TRUE
) or based on the maximum projection coordinate (nmode=FALSE
). Only clusters with >cthr
cells are included in the analysis. If fr=TRUE
then the Fruchterman-Rheingold layout will be used for representation of the inferred lineage tree, and if um=TRUE
a UMAP representation will be used. Otherwise, representation will be done in t-SNE space. The computation of the lineage graph is independent of the dimensional reduction method which is only used for visualization.
ltr <- projcells(ltr,cthr=5,nmode=FALSE,fr=TRUE)
If projections are used for link determenation (nmode=FALSE
), the derivation of link significance is done by comparing to the link population after randomizing cell positions within the boundaries imposed by the gene expression manifold. This is done by bootstrapping using 500 randomizations by default. More randomizations are possible, but obviously linearly increase runtime.
ltr <- projback(ltr,pdishuf=100)
Based on this information, a lineage graph is computed to approximate the lineage tree (a tree structure is not strictly imposed).
ltr <- lineagegraph(ltr)
Finally, link p-values are computed and a threshold pthr
is applied on the p-values:
ltr <- comppvalue(ltr,pthr=0.1)
The resulting graph can be plotted, overlaid with a dimensional reduction representation (Fruchterman-Rheingold or t-SNE, see projcells
). To retain only the more populated links, a cutoff scthr
on the linkscore can be applied, e.g. 0.2:
plotgraph(ltr,scthr=0.2,showCells=FALSE,showMap=TRUE)
To predict the stem cell, the StemID score can be computed and visualized:
x <- compscore(ltr,scthr=0.2)
StemID offers a number of plotting functions to inspect the results. RaceID performs clustering using Pearson's correlation as a metric by default. The StemID projections require a Euclidean space and thus an initial embedding into a high-dimensional space is performed by classical multidimensional scaling. To inspect how well cell-to-cell distances are preserved, a histogram of the log-ratios between the original and transformed distances can be plotted:
plotdistanceratio(ltr)
The StemID prediction can be compared to a minimal spanning tree of the cluster medoids:
plotspantree(ltr)
The cell projections onto the links can be directly compared with this minimal spanning tree:
plotspantree(ltr,projections=TRUE)
All linkscores and fold enrichments of the population on a link can be plotted as heatmaps:
plotlinkscore(ltr)
projenrichment(ltr)
The (z-scores of the) projections of all cells from a given cluster across all links can be plotted as heatmap, e.g. for cluster 3:
x <- getproj(ltr,i=3)
All cells on two different branches originating from the same cluster, e.g. cluster 3 cells on the links to cluster 1 and 8, can be extracted for the computation of differentially expressed genes:
x <- branchcells(ltr,list("1.3","3.8"))
head(x$diffgenes$z)
## Prss32 Nip7 Dhrs4 Srsf3 Tm4sf5 Tsc22d1
## 3.224859 2.814688 2.368598 2.335748 2.200576 2.187001
The cells on the two branches can be plotted as additional clusters in the dimensional reduction representation:
plotmap(x$scl,fr=TRUE)
Since the randomizations of cell positions for the derivation of link significance require long computation time, and the projection-based method leads to some weak links which are potentially false positives (and can be filtered out based on linkscore), the nearest-neighbour-based method has now been selected to be the default mode of StemID. This method is more robust and fast even on large datasets. The downside is that it will miss some weak links, i.e. lead to more false negatives in comparison to the projection mode.
First, a lineage tree object needs to be initialized followed by the calculation of the transcriptome entropy of each cell.
ltr <- Ltree(sc)
ltr <- compentropy(ltr)
Next, cell projection are calculated with the parameter nmode=TRUE
, which is also the default value:
ltr <- projcells(ltr,cthr=5,nmode=TRUE,fr=TRUE,knn=3)
The knn
parameter determines how many nearest neighbours are considered in each cluster for determining the link assignment: the distance two each cluster is calculated as the average across the distance of a cell to the knn
nearest neighbours within each other cluster, and the cell is assigned to the link with the cluster minimizing this distance. Now, the lineage tree is inferred and the p-values for the links are calculated based on a binomial model:
ltr <- lineagegraph(ltr)
ltr <- comppvalue(ltr,pthr=0.05)
The resulting lineage graph can be inspected and reveals the expected trajectories connecting the stem cells (cluster 2 and 3 of cycling and quiescent cells, respectively) to enterocytes (cluster 4) via transiently amplifying progenitors (cluster 1), to Paneth cells (cluster 7), and to goblet cells (cluster 6). The StemID score suggests stem cell identity for clusters 2 and 3:
plotgraph(ltr,showCells=FALSE,showMap=TRUE)
x <- compscore(ltr)
RaceID offers the possibility of batch correction utilizing an internal method or the published mnnCorrect
function from the batchelor
package (Haghverdi et al. 2018). The batchelor
package needs to be separately installed from bioconductor if this mode is desired. In order to apply batch effect removal, a list with a vector of cell ids for each batch needs to be defined, e.g.:
n <- colnames(intestinalData)
b <- list(n[grep("^I5",n)],n[grep("^II5",n)],n[grep("^III5",n)],n[grep("^IV5",n)],n[grep("^V5",n)])
This list is provided as input to the filterdata
function, and the bmode
argument is initialized with the desired method, i.e. mnnCorrect
or RaceID
. The latter method simply compares the local neigbourhood, i.e. the set of k nearest neighbours, for each cell between two batches and identifies the neighbourhood of the two batches with the smallest average distance. A differential gene expression analysis between the closest neighbourhoods of two batches yields batch associated genes. The next batch is then compared in the same way to the merged dataset of the previous batches. Batches are compared and successively merged according to the order they are provided in b
. An additional input parameter knn
controls the number of nearest neighbours, i.e. the size of the neighbourhood.
sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000,LBatch=b,bmode="RaceID",knn=10)
## RaceID3 batch correction...
## Adding batch: 2
##
## Adding batch: 3
##
## Adding batch: 4
## Rplp2
## Adding batch: 5
## Rplp2
The filterdata
function will identify all batch-associated genes, which are stored in the filterpar$BGenes
slot of the SCseq
object. All genes that correlate to a batch gene are removed for the computation of a distance object. This is a minimally invasive strategy in comparison to mnnCorrect
, which works well if batches are very similar, such as datasets produced from the same material using the same single-cell RNA-seq technology.
RaceID also offers optional imputing of gene expression, which can be useful if gene expression differences between cell types or states are governed only by lowly expressed genes and are difficult to resolve by clustering based on raw counts. If imputing is desired, the knn
argument needs to be initialized with the number of nearest neighbours used for imputing gene expression :
sc <- compdist(sc,knn=5,metric="pearson")
Now, for each cell the knn
nearest neighbours are used to infer a local negative binomial for each gene, utilizing a weighted mean expression and the internal RaceID noise model to obtain the corresponding negative binomial. The weights are derived by quadratic programming, computing the expression vector of a cell as a linear combination of its knn
nearest neighbours. The cell itself contributes with the same weight as the aggregated weights of the nearest neighbours to the weighted mean expression. With the help of this negative binomial the tail probability of each gene is derived across all knn
nearest neighbours. The geometric means of these tail probabilities are finally applied as a weights for each nearest neighbours in the calculation of the imputed gene expression. This strategy ensures that gene expression can only be learned from nearest neighbours following the same transcript count distributions.
After this, all steps remain the same. Imputing often helps to improve cluster discrimination and stability. Importantly, distances derived from imputed gene expression are only used for clustering. The outlier identification relies on unimputed gene expression, and hence can correct spurious clusters produced from imputed values.
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- compfr(sc)
sc <- comptsne(sc)
plotmap(sc,fr=TRUE)
If batch effect removal has been applied, the remaining batch effect can be checked by plotting symbols representig the sample of origin:
types <- sub("(\\_\\d+)$","", colnames(sc@ndata))
plotsymbolsmap(sc,types,fr=TRUE)
Ideally, all samples should intermingle in each clusters. Imputed gene expression can be plotted by setting the imputed
argument to TRUE
. Otherwise, unimputed values are shown.
plotexpmap(sc,"Mki67",imputed=TRUE,fr=TRUE)
plotmarkergenes(sc,c("Clca4","Mki67","Defa24","Defa20","Agr2","Apoa1"),imputed=TRUE,samples=types)
An expression matrix with imputed expression values can be extracted for further analysis:
k <- imputeexp(sc)
RaceID can also regress out variability associated with particular sources such as batches or cell cycle. If batch effect removal has been done by the filterdata
function with bmode="RaceID"
then this function can further regress out residual variability remaining after batch associated genes have been discarded for the distance computation. In the case, the argument Batch
has to be set to TRUE
and vars
can be left empty if no further sources of variability should be regressed out. Batch effects can also be regressed out directly without prior removal using the filterdata
function:
sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
vars <- data.frame(row.names=colnames(intestinalData),batch=sub("(\\_\\d+)$","",colnames(intestinalData)))
sc <- varRegression(sc,vars)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)
However, regression also leads to loss of biological variation and this step is only recommended if variability associated with a particular variable is a strong confounding factor and cannot be removed by other means.
After running the filterdata
function, a prior dimensional reduction using PCA or ICA can be performed using the CCcorrect
function. This function can also be provided with a list vset
of sets of genes, and principal components with loadings enriched in any of these sets will be discarded. Another options is to provide a list CGenes
of genes, and sets to be tested for enrichment in each component are derived as the groups of all genes correlating to a component in FGenes. The CCcorrect
function predicts a dimension for the prior dimensional reduction based on an ellbow function of the explained variability as a function of the number of components. This can be inspected by the plotdimsat
function and manually adjusted:
sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
sc <- CCcorrect(sc,dimR=TRUE)
plotdimsat(sc)
plotdimsat(sc,change=FALSE)
sc <- filterdata(sc,mintotal=2000)
sc <- CCcorrect(sc,nComp=9)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)
Rerunning CCcorrect
requires to run filterdata
first, because otherwise the dimensional reduction scores in the dimRed
slot will be subject to a second dimensional reduction, which is not desired. All sub-sequent steps remain unaltered.
RaceID provides the option to run a random forests based reclassifiction, in order to obtain a stable clustering partition. This can be done on the final clustering after running findoutliers
:
sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- rfcorrect(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)
However, this is normally not required, since the improved outlier detection of the current version leads to stable clusters which do not change substantially after applying this function. Running rfcorrect
takes very long for large datasets and should be omitted in this case.
For large dataset (>25k cells), we recommend using the VarID approach (see below) for density based clustering which does not require computation and storage of a huge distance matrix. For datasets below this size we recommend running RaceID/StemID with specific parameter settings. Since the correlation metric does not require dataset normalization, this approach has certained advantages compared to VarID. However, the latter can also be run with a correlation matrix. In the following, we discuss a few paramters critical for the runtime of RaceID/StemID on large datasets.
sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
sc <- compdist(sc)
Preferentially, clustering should be done by FUNcluster="kmedoids"
but "hclust"
often gives similar results and is significantly faster.
sc <- clustexp(sc,samp=1000,FUNcluster="hclust")
For the determination of the cluster number and the inference of Jaccard's similarity, sub-sampling should be applied by setting the subset size samp
to an integer number. A good choice could be 10-25% of the total number of cells. For large datasets this should be sufficient to discriminate the most abundant cluster, and additional smaller clusters will automatically be identified in the next step using the findoutliers
function. It is important that the clustnr
argument is much larger then the expected number of clusters. If stability analysis is not wanted, one can set bootnr=1
. If clustering is re-run with a specific cluster number, e.g. cln=12
, then the saturation criterion should be disabled by setting sat=FALSE
.
If the cluster granularity is too fine or too coarse, the probthr
argment can be decreased or increased, respectively, e.g.:
sc <- findoutliers(sc,probthr=1e-4)
It is adviced to change the perplexity
argument to larger values, when computing the t-SNE map for large datasets, e.g. perplexity=200
. However, a large perplexity will return an error for small datasets.
sc <- comptsne(sc,perplexity=100)
plotmap(sc)
The Fruchterman-Rheingold layout critically depends on the number knn
of nearest neighbours, and different values should be tested:
sc <- compfr(sc,knn=10)
plotmap(sc,fr=TRUE)
For a sub-sequent StemID analysis of very large datasets, the nearest-neighbour mode (see above) will help to reduce the runtime.
To inspect pseudotemporal expression profiles, functions provided by the FateID
package can be utilized. First, the trajectory needs to be defined based on a sequence of clusters. This sequence should ideally correspond to a trajectory predicted by StemID2, i.e. the clusters should be connected by a series of significant links. However, non-linked clusters can also be included. A pseudo-temporally ordered vector of cells along a StemID trajectory can be extracted with the cellsfromtree
function from an Ltree
object:
n <- cellsfromtree(ltr,c(2,1,4))
The list n
contains a vector n$f
of cell ids ordered by their projection coordinate along the trajectory reflecting differentiation pseudotime. This vector and a filtered or unfiltered expression matrix can be used as input for pseudo-temporal gene expression analysis. The filtered expression data used for RaceID3 can be extracted with the getfdata
function:
x <- getfdata(ltr@sc)
Additional filtering and subsetting of the gene expression matrix for cells on the trajectory, n$f
, is done in the next step, utilizing functions from the FateID
package:
library(FateID)
fs <- filterset(x,n=n$f)
The filterset
function can be used to eliminate lowly expressed genes on the trajectory from the subsequent analysis and has two additional arguments to discard genes, which are not expressed at a level of minexpr
or higher in at least minnumber
of cells. The function returns a filtered expression data frame with genes as rows and cells as columns in the same order as in the input vector n
. In the next step, a self-organizing map (SOM) of the pseudo-temporal expression profiles is computed:
s1d <- getsom(fs,nb=1000,alpha=.5)
This map provides a grouping of similar expression profiles into modules. The first input argument is again an expression data frame. In this case, we use the filtered expression table generated by the filterset function to retain only genes that are expressed on the trajectory under consideration. Pseudo-temporal expression profiles along the differentiation trajectory of interest are computed after smoothing by local regression with smoothing parameter alpha
.
This function returns a list of the following three components, i. e. a som object returned by the function som
of the package som
, a data frame x
with smoothed and normalized expression profiles, and a data frame zs
of z-score transformed pseudo-temporal expression profiles.
The SOM is then processed by another function to group the nodes of the SOM into larger modules and to produce additional z-score transformed and binned expression data frames for display:
ps <- procsom(s1d,corthr=.85,minsom=3)
The first input argument is given by the SOM computed by the function getsom
. The function has two additional input parameters to control the grouping of the SOM nodes into larger modules. The parameter corthr
defines a correlation threshold. If the correlation of the z-scores of the average normalized pseudo-temporal expression profiles of neighboring nodes in the SOM exceeds this threshold, genes of the neighboring nodes are merged into a larger module. Only modules with at least minsom
genes are kept. The function returns a list of various data frames with normalized, z-score-transformed, or binned expression along with the assignment of genes to modules of the SOM (see man pages for details).
The output of the processed SOM can be plotted using the plotheatmap function. First, in order to highlight the clustering partition y
the same color scheme as in the SCseq
object can be used:
y <- ltr@sc@cpart[n$f]
fcol <- ltr@sc@fcol
Now, the different output data frames of the procsom
function can be plotted.
Plot average z-score for all modules derived from the SOM:
plotheatmap(ps$nodes.z,xpart=y,xcol=fcol,ypart=unique(ps$nodes),xgrid=FALSE,ygrid=TRUE,xlab=FALSE)
Plot z-score profile of each gene ordered by SOM modules:
plotheatmap(ps$all.z,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE)
Plot normalized expression profile of each gene ordered by SOM modules:
plotheatmap(ps$all.e,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE)
Plot binarized expression profile of each gene (z-score < -1, -1 < z-score < 1, z-score > 1):
plotheatmap(ps$all.b,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE)
In order to inspect genes within individual modules of the SOM, these genes can be extracted given the number of the module. The module numbers are contained in the return value nodes
of the procsom
function and can be extracted, e. g. for module number 24:
g <- names(ps$nodes)[ps$nodes == 24]
The average pseudo-temporal expression profile of this group can be plotted by the function plotexpression
:
plotexpression(fs,y,g,n$f,col=fcol,name="Node 24",cluster=FALSE,alpha=.5,types=NULL)
In the same way it is possible to plot expression profiles of individual genes, e.g.:
plotexpression(fs,y,"Clca4",n$f,col=fcol,cluster=FALSE,alpha=.5,types=NULL)
It is also possible to highlight the data points as specific symbols, for example reflecting batches, by using the types argument:
plotexpression(fs,y,g,n$f,col=fcol,name="Node 24",cluster=FALSE,alpha=.5,types=sub("\\_\\d+","",n$f))
VarID is a novel algorithm utilizing a k nearest neighbour graph derived from transcriptome similarities measured by a distance matrix with an arbitrary metric. The key idea is the inference of local neighbourhoods of cells in the same state, i.e. all k nearest neighbours follow the same transcript count distribution as the cell to which they are nearest neighbours. The computation starts with k nearest neighbours and infers a negative binomial distribtion for the transcript counts of each gene, derived from the mean-variance dependence computed the same way as for the RaceID background model. The mean expression is derived from a weighted average across the k nearest neighbours, where the weights are inferred using quadratic programming, representing the cell as a linear combination of its neighbours. The contribution of the "central" cell itself can be controlled by a given scale parameter alpha
. Typical values of alpha
should be in the range of 0 to 10. The default value is NULL
. In this case, alpha
is inferred by an optimization, i.e., alpha
is minimized under the constraint that the gene expression in a cell does not deviate more then one standard deviation from the predicted weigthed mean, where the standard deviation is calculated from the predicted mean using the background model (the average dependence of the variance on the mean expression). With the negative binomial distribution inferred from the local means and the global mean-variance dependence, the probabilities of observed transcript counts for all genes can be computed in each neighbour, and, after multiple testing correction, the minimum probability across all genes serves as a proxy for the probability of neighbours being in the same state. If this probability falls below a user-defined threshold, the neighbour is discarded. After this procedure is applied to the nearest neighbours of each cell, a pruned k nearest neighbour graph is obtained, in which the remaining links indicate that neighbours follow the same transcript count distribution for all genes. These local neighbourhoods can be used to compute local properties such as gene expression variability, which in turn can be correlated to cell state or predicted fate, derived, e.g., from RaceID or FateID analysis, respectively.
Although VarID analysis can be performed independently of the RaceID data object, an integration with a RaceID analysis permits convenient visualization of the results. For this reason, we here demonstrate an integrated pipeline, but explain at each step, how computation can be performed given minimal input data, i.e. an expression matrix only.
The VarID method requires unique molecular identifier (UMI) counts, which have been shown to follow a negative binomial distribution, a property also crucial for the RaceID analysis.
Given such a UMI matrix with cells as columns and genes as rows, we could first initialize a RaceID object, and perform cell and gene filtering to retain only expressed genes and high quality cells. This is followed by the computation of a distance matrix, which can be used for the VarID analysis.
sc <- SCseq(intestinalData)
sc <- filterdata(sc)
# calculate distance matrix
sc <- compdist(sc)
However, the preferred mode of running VarID does not require a distance matrix, and relies on k-nearest neighbor analysis in PCA space (see below). For large datasets, the computation of a distance matrix is prohibitive due to a memory requirement scaling with N^2 if N is the number of cells. If the metric
argument of the compdist
function is not initialized otherwise, the distance matrix is calculated as 1 - Pearson's correlation of the transcriptomes.
The filtered, unnormalized, and otherwise unprocessed raw UMI count matrix can then be retrieved by
d <- getExpData(sc)
and the distance matrix can be obtained by
distM <- sc@distances
VarID can handle sparse matrices, so d
can be a sparse matrix defined with the Matrix
package, if created independently of RaceID. The getExpData
function returns a sparse matrix. No distance matrix input is required for the distM
argument of pruneKnn
. In fact, if given it will be ignored as long as large
is set to TRUE
.
With this input the derivation and pruning of a k nearest neighbour graph can be performed:
res <- pruneKnn(d,distM=distM,large=FALSE,metric="pearson",genes=NULL,knn=10,alpha=NULL,no_cores=1,FSelect=FALSE)
In principle, only d
is required as input data for the VarID analysis and the distance matrix can be computed internally if the distM
argument is omitted. In this case, the metric
argument defines the metric utilized for the distance matrix calculation.
The VarID analysis can also be performed without computing a distance matrix, since computation and storage of distance matrices is not feasible for very large datasets (>30k cells) on a conventional work station due to memory limitations. As an alternative, the pruneKnn
function can be executed with the large
argument set to TRUE
, which is the default setting and usually shows better performance. In this case, the input expression matrix is first optionally corrected for variability associated with total transcript count per cell by a negative binomial regression (if regNB=TRUE
as by default). The mean dependence of the regression parameters is smoothed, and Pearson residual are computed. The matrix of Pearson residuals (default setting), or, alternatively, the raw count matrix then undergoes a dimensional reduction by a principle component analysis (PCA), including the top pcaComp
principle components (top 100 principle components by default). This reduced feature matrix is subsequently used for a fast k nearest neighbour search by the get.knn
function from the FNN
package based on euclidean distances, i.e., a correlation-based metric can no longer be used. The algorithm
parameter of the pruneKnn
function enables the selection of the search algorithm to apply. In general, it is recommended to run VarID with large=TRUE
and regNB=TRUE
(default settings). A RaceID object can be utilized to facilitate the analysis, but in this mode the compdist
function does not need to be executed:
sc <- SCseq(intestinalData)
sc <- filterdata(sc)
d <- getExpData(sc)
res <- pruneKnn(d,large=TRUE,pcaComp=100,regNB=TRUE,genes=NULL,knn=10,alpha=1,no_cores=1,FSelect=FALSE,ngenes=2000)
The genes
argument allows to limit the analysis to a subset of genes, knn
defines the number of nearest neighbours, alpha
is the scale parameter for the "central" cell, and FSelect
allows internal feature selection. The background model is calculated internally in the same way as done in RaceID and feature selection follows the same logic. The function further permits multithreading and the number of cores can be defines using the no_cores
argument. If this is set to NULL
, the function detects the number of available cores and uses this number minus two to leave resources for other tasks running on the same machine. Calling help
for this function explains all parameters in detail (as for all other functions).
This function internally fits a background model of gene expression variability, which can be inspected:
plotBackVar(res)
This model can also be computed separately:
bg <- fitBackVar(d)
plotBackVar(bg)
The function returns a list with the used distance matrix, a matrix with the indices of the k nearest neighbours for each cell, and a matrix with the corresponding probabilities. Finally, the background model is also returned.
From this object, a sparse distance matrix can optionally be created with non-zero entries only for k nearest neighbours with probabilities larger or equal than pvalue
:
y <- createKnnMatrix(res,pvalue=0.01)
This is not needed as input for the subsequent step, but enables independent analyses by the user.
This pruned k nearest neighbour information can be used as input for graph inference, Louvain clustering, and the derivation of a Fruchterman-Rheingold graph layout:
cl <- graphCluster(res,pvalue=0.01)
The function returns a list of three components: a graph object graph
, an object louvain
containing informarmation on the clustering, and a data.frame fr
containing a Fruchterman-Rheingold layout computed from the pruned k nearest neighbour graph, all derived using the igraph
package.
By setting the use.weights
argument of graphCluster
to TRUE
(default), clustering can be performed on the pruned knn graph with weighted links, where weights correspond to the link probabilities. This may lead to the emergence of too many clusters (which could happen, e.g., due to variability in the link probabilities when integrating different experimental batches). In this case, the use.weights
parameter can be set to FALSE
, and Louvain clustering is performed on a pruned knn graph with equal weights for all links.
The pvalue
parameter determines the probability cutoff for link pruning, i.e., links with probabilities lower than pvalue
are removed.
To visualize the clustering with the help of RaceID, one can update the Scseq
object with the results:
sc <- updateSC(sc,res=res,cl=cl)
This function call will initialize the cpart
and the cluster$kpart
slots with the clustering partition in cl$louvain$membership
, and the fr
slot with the Fruchterman-Rheingold layout.
The clustering can readily be inspected in the Fruchterman-Rheingold layout:
plotmap(sc,fr=TRUE)
After computing a t-SNE map or, the clustering can be also be visualized:
sc <- comptsne(sc)
plotmap(sc)
Alternatively, a umap representation can be computed for visualization:
sc <- compumap(sc)
plotmap(sc,um=TRUE)
The pruned k nearest neighbour graph allows the derivation of transition probabilities between clusters, based on the remaining links connecting two clusters (at a minimum probability given by pvalue
) and the inferred probabilities of these links:
probs <-transitionProbs(res,cl,pvalue=0.01)
If RaceID is used for visualization and has been updated with the clustering using the updateSC
function (see above), the transition probabilities can be visualized in the dimensional reduction representation:
plotTrProbs(sc,probs,tp=.5,prthr=0,cthr=0)
The argument prthr
allows discarding all links with a transition probability <= prthr
and the th cthr
argument sets a cluster size threshold. Only clusters with > cthr
cells are shown.
The inference of homogenous local neighbourhoods, given by a cell and its remaining nearest neighbours after pruning, permits the computation of local quantities across such sets of similar cells. To investigate the dynamics of gene expression noise across cell states, we derive local estimates of gene expression variability. The systematic dependence of gene expression variance on the mean expression is a central challenge for the derivation of differential variability between two states. VarID solves this problem using one of two alternative strategies. The first strategy is based on a negative binomial regression on the total transcript count in each cell. The pearson residuals of this regression serve as an expression estimate corrected for the mean-dependence.To avoid overfitting, the regression parameters as a function of mean expression are smoothed by a local polynomial regression. Local variability is then computed as the variance of pearson residuals across the pruned k nearest neighbours. The second approach relies on infering a baseline variability, comprising biological and technical noise, reflecting the systematic dependency, and regresses out this trend, in order to estimate a corrected variability, which no longer depends on mean expression. In both cases the corrected variability can then be meaningfully compared across local neighbourhoods. The distribution of corrected variabilities can thus be compared between cell states, or clusters, in order to infer genes with differential variability.
The corrected variability estimate can be computed for the local neighbourhoods of all cells:
noise <- compNoise(d,res,regNB=TRUE,pvalue=0.01,genes = NULL,no_cores=1)
The default approach is the second approach, i.e., the direct correction of the variability by regressing out the baseline variability. Alternatively, the negative binomial regression can be applied by setting the parameter regNB=TRUE
. The function returns the noise model (only used if regNB=FALSE
) and the corrected noise estimates. If regNB=TRUE
the function also return a list regData
with the logarithmic UMI counts of all cells used for the regression, the pearson residuals, and the parameters inferred by the regression before and after smoothing.
If the regression has been computed by pruneKnn
and the slot res$regData
is not NULL
, the pearson residuals from this step are used and the calculation is not repeated. However, it is also possible to determine neighborhoods with a given distance matrix and apply the regression only at this step for the noise inference.
If negative binomial regression has been applied (regNB=TRUE
), the raw and smoothed regression parameters can be plotted:
plotRegNB(d,noise,par.nb="beta")
par.nb
is the parameter to be plotted, and has to be one of "theta", "beta", or (Intercept)". theta
is the disersion parameter and beta
is the regression coefficient of the log10 total UMI count. (Intercept)
is the intercept inferred by the regression.
To check if the negative binomial regression could remove the mean-dependence of the variance, the latter can be plotted:
plotPearsonRes(noise,log=TRUE)
The plot also shows Spearman's correlation coefficient between mean and variance of the Pearson residuals. If a systematic dependence is still observed, the noise should be estimated by the alternative method based on regressing out the baseline mean-dependence of the variance (compNoise
with regNB=FALSE
).
noise <- compNoise(d,res,regNB=FALSE,pvalue=0.01,genes = NULL,no_cores=1)
The dependence of the variance on the mean, and the fit to the baseline variability (used if regNB=FALSE
) can be visualized before and after correction for the trend:
plotNoiseModel(noise)
plotNoiseModel(noise,corrected=TRUE)
The function baseLineVar
can be used to extracted the corrected variability. See help function.
To visualize the noise level using RaceID, the SCseq
object needs to be updated with the results from the noise computation:
sc <- updateSC(sc,noise=noise,flo=.1)
The flo
argument sets a lower limit to the noise estimates, since, due to locality, the estimates can fall below the baseline level. This will alleviate the impact of outliers in for the subsequent analysis and visualization.
This initializes a slot noise
of the SCseq
object with gene by cell matrix of corrected noise etsimates. The noise levels of a given gene (or group of genes), can then be visualized in the same way as gene expression with the help of RaceID functions.
Gene expression (on logarithmic scale):
plotexpmap(sc,"Lgr5",logsc=TRUE)
Corrected gene expression variability (on logarithmic scale):
plotexpmap(sc,"Lgr5",logsc=TRUE,noise=TRUE)
The corrected noise can also be plotted in a cluster heatmap and directly compared to gene expression. After deriving a gene expression heatmap, the order of genes stored in ph
(returned by the pheatmap
function of the pheatmap
package), can be applied to the noise heatmap in order to permit direct comparability:
genes <- c("Lyz1","Defa20","Agr2","Clca3","Muc2","Chgb","Neurog3","Apoa1","Aldob","Lgr5","Clca4","Mki67","Pcna")
ph <- plotmarkergenes(sc,genes=genes,noise=FALSE)
plotmarkergenes(sc,genes=genes[ph$tree_row$order],noise=TRUE,cluster_rows=FALSE)
With the clustering derived by graphCluster
and the object returned from the compNoise
function, genes with elevated noise levels in a cluster (or a set of clusters) can be inferred:
ngenes <- diffNoisyGenes(noise,cl,set=c(5),no_cores=1)
head(ngenes)
## log2FC pvalue
## Tac1 2.1105922 4.451282e-05
## Neurog3 1.9999921 2.260006e-04
## Chga 1.5452946 6.519589e-01
## Rplp2 1.3714260 1.352253e-11
## Cox4i1 0.9661081 5.824249e-09
## Ndufa10 0.9142775 4.452239e-06
This functions applies a t-test to compare variability in a cluster (or set of clusters) versus the remaining clusters. It is also possible to define a background set (bgr
) of clusters.
The top varaible genes can be plotted in a heatmap:
genes <- head(rownames(ngenes),50)
ph <- plotmarkergenes(sc,genes=genes,noise=TRUE,cluster_rows=FALSE,zsc=TRUE)
One can also cluster the cells and/or genes in the heatmap based on the noise quantification:
ph <- plotmarkergenes(sc,genes=genes,noise=TRUE,cluster_rows=TRUE,cluster_cols=TRUE)
The gene expression itself can be represented with the same order of cells and genes by utilizing the ph
output:
plotmarkergenes(sc,genes=ph$tree_row$labels[ ph$tree_row$order ],noise=FALSE,cells=ph$tree_col$labels[ ph$tree_col$order ], order.cells=TRUE,cluster_rows=FALSE)
It is also possible to order genes by their noise levels across a (set of) cluster or the entire dataset, if the cl
argument is omitted:
mgenes <- maxNoisyGenes(noise,cl=cl,set=5)
head(mgenes)
## Rplp2 Dmbt1 Tubb4b Tuba1b Tmem38b Fos
## 3.890883 3.148788 2.484531 2.311857 2.248384 2.170655
plotmarkergenes(sc,genes=head(names(mgenes),50),noise=TRUE)
To further study dynamics of gene expression across cell types and along inferred differentiation trajectories, the noise estimates can be co-analyzed with gene expression and lineage trajectories as inferred by RaceID and StemID. In particular, we mention that imputing expression with RaceID follows the same logic as in pruneKnn
. Therefore, the derived variability estimates can be analyzed in conjunction with the imputed mean expression in the same local neighbourhood (see compdist
and imputeexp
functions of RaceID).
VarID offers the possibility to remove batch effects and other confounding sources of variability. To achieve this, batch variables and other latent variables can be included as dependent variables in the negative binomial regression, utilized in the pruneKnn
and compNoise
functions if the parameter regNB
is set to TRUE
. For the pruneKnn
function, this option only exists if large
is set to TRUE
. The batch
parameter is given by a categorical vector of batch names.
sc <- SCseq(intestinalData)
sc <- filterdata(sc)
d <- getExpData(sc)
batch <- sub("_.+","",colnames(d))
names(batch) <- colnames(d)
head(batch)
To regress out additional latent variables, the parameter regVar
has to be initialized with a data.frame of these variable, e.g. a cell cycle score. Column names of regVar
indicate the latent variable names, and row names correspond to cell IDs, i.e. valid column names of d
. Interaction terms between the batch variable and the UMI count or the other latent variables in regVar
are included.
res <- pruneKnn(d,large=TRUE,pcaComp=100,regNB=TRUE,batch=batch,genes=NULL,knn=10,alpha=1,no_cores=1,FSelect=FALSE,ngenes=2000)
cl <- graphCluster(res,pvalue=0.01)
The corrected noise estimate can be computed for the local neighbourhoods of all cells:
noise <- compNoise(d,res,regNB=TRUE,batch=batch,pvalue=0.01,genes = NULL,no_cores=1)
Here, the pearson residuals from the regression are used if regNB
equals TRUE
. If the regression has been computed by pruneKnn
and the slot res$regData
is not NULL
, the pearson residuals from this step are used and the calculation is not repeated. However, it is also possible to determine neighborhoods with a given distance matrix and apply the regression only at this step for the noise inference.
To visualize the clustering with the help of RaceID, one can update the Scseq
object with the results:
sc <- updateSC(sc,res=res,cl=cl,noise=noise)
After computing a t-SNE map or, similarly, a umap, the clustering can be also be visualized in these maps:
sc <- compumap(sc)
plotsymbolsmap(sc,batch)
See previous paragraph for more advice on how to explore the output data.
For bug reports and any questions related to RaceID and StemID please email directly to link.
Processed count data from the Cell Ranger pipeline (or equivalent formats) can be read from the working directory (or any other path) and transformed into a sparse count matrix, which can be loaded into an SCseq
object. This requires the Matrix
package which is installed as a dependency together with RaceID
. The following chunk of code can be used:
require(Matrix)
require(RaceID)
x <- readMM("matrix.mtx")
f <- read.csv("features.tsv",sep="\t",header=FALSE)
b <- read.csv("barcodes.tsv",sep="\t",header=FALSE)
rownames(x) <- f[,1]
colnames(x) <- b[,1]
sc <- SCseq(x)
Datasets with >25k cells should be analyzed with VarID instead of RaceID, which is more suitable for large datasets and does not require a distance matrix. The memory requirement of a distance matrix scales with N^2 as a function of the number of cells N. VarID should be run with with parameters large=TRUE
and regNB=TRUE
of the pruneKnn
function in order to achieve optimal performance (default parameters). To increase processing speed VarID can be run with multiple cores. However, each core requires additional memory, and with too many cores VarID will run out of memory. To reduce computation time, the number of genes used for the negative binomial regression of total UMI count per cell can be limited to, e.g., 2000 by setting the ngenes
parameter.
Example for mouse data, including filtering of mitochondrial genes, ribosomal genes, and Gm* genes (and genes correlating to these groups):
require(Matrix)
require(RaceID)
x <- readMM("matrix.mtx")
f <- read.csv("features.tsv",sep="\t",header=FALSE)
b <- read.csv("barcodes.tsv",sep="\t",header=FALSE)
rownames(x) <- f[,1]
colnames(x) <- b[,1]
sc <- SCseq(x)
sc <- filterdata(sc,mintotal=1000,CGenes=rownames(x)[grep("^(mt|Rp(l|s)|Gm\\d)",rownames(x))])
expData <- getExpData(sc)
res <- pruneKnn(expData,no_cores=5)
cl <- graphCluster(res,pvalue=0.01)
probs <- transitionProbs(res,cl)
## compute noise from corrected variance
noise <- compNoise(expData,res,regNB=FALSE,pvalue=0.01,no_cores=5)
sc <- updateSC(sc,res=res,cl=cl,noise=noise,flo=.1)
sc <- updateSC(sc,res=res,cl=cl)
sc <- comptsne(sc)
sc <- compumap(sc)
This could be a consequence of technical noise, e.g., if experimental batches with strong differences in total UMI counts were co-analyzed. In such cases the background models of RaceID or VarID could indicate the presence of many spurious outliers. In RaceID, the number of outliers can be reduced by setting the the probthr
parameter of the findoutliers
function to values closer to zero. This is the probability threshold for outlier detection. In VarID the pvalue
parameter of the graphCluster
function can be set to lower values. This is the probability threshold for pruning links of the knn network. In the extreme case that no outliers (or pruning) are desired, probthr
(or pvalue
) should be set to zero. Moreover, the use.weights
parameter of the graphCluster
function can be set to FALSE
; in this case, Louvain clustering is performed on a (pruned) knn-network with equal weights for all links. By default, links are weighted by the link probability, and if strong technical differences lead to low link probabilities, this could be the reason for the emergence of too many clusters.
Grün, D., L. Kester, and A. van Oudenaarden. 2014. “Validation of noise models for single-cell transcriptomics.” Nat. Methods 11 (6): 637–40.
Grün, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2): 266–77.
Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnology 36 (5): 421–27.
Herman, J. S., Sagar, and D. Grün. 2018. “FateID infers cell fate bias in multipotent progenitors from single-cell RNA-seq data.” Nat. Methods 15 (5): 379–86.
Love, M. I., W. Huber, and S. Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biol. 15 (12): 550.