Prerequisites

RaceID is a method for cell type identification from single-cell RNA-seq data by unsupervised learning. An initial clustering is followed by an outlier identification based on a backgorund model of combined technical and biological variability in single-cell RNA-seq data obtained by quantification with unique molecular identifiers. StemID permits subsequent inference of a lineage tree based on clusters, i.e. cell types, identified by RaceID. The current version of RaceID (RaceID3) and StemID (StemID2) are published (Herman, Sagar, and Grün 2018). This package implements additional improvements in rare cell type detection, offers batch effect removal utilities, optional imputing of gene expression, and substantially decreases runtime as well as memory usage. We tested the method successfully on a dataset of ~50k cells. We now added another method termed VarID to the package, which allows the quantification of local gene expression variability, but furthermore offers an alternative clustering approach to RaceID based on density clustering which has certain advantages and, in particular, permits handling much larger datasets, since there is no requiremnet of storing a distance matrix. This method can be fully integrated into a RaceID analysis (see section "VarID"). RaceID offers several optional steps and here we will show examples of how to perform typical RaceID/StemID analyses.

RaceID

Input to RaceID is a matrix of raw expression values (unique molecular identifiers with or without Poisson correction (Grün, Kester, and van Oudenaarden 2014)) with cells as column and genes as rows. This matrix can be provided as a matrix object, a data.frame or a sparse matrix produced by the Matrix package.

The RaceID package comes with sample data intestinalData for murine intestinal epethelial cells stored in sparse matrix format. The dataset was published previously (Grün et al. 2016).

RaceID and StemID functions have various input and output parameters, which are explained in detail on the help pages. Here, we mostly use default parameters, which represent a good choice for common datasets.

To start the analysis, a RaceID single-cell sequencing (SCseq) object is initialized with a count matrix.

library(RaceID)
sc <- SCseq(intestinalData)

The first step is the application of filtering for the purpose of quality control. Cells with a relatively low total number of transcripts are discarded.

sc <- filterdata(sc,mintotal=2000)

In this example, we filter out cells with <2,000 transcipts. The function allows further filtering of genes by choosing the input parameters minexpr and minnumber, i.e. only genes with at least minexpr transcripts in at least minnumber cells are retained. The filtered and normalized expression matrix (normalized to the minimum total transcript count across all cells retained after filtering) can be retrieved by the getfdata function:

fdata <- getfdata(sc)

The filterdata function allows the detection and removal of batch effects by different methods as outlined below in addtional examples. Alternatively, individual genes or groups of genes correlating to specific genes can be removed with the FGenes and CGenes arguments. This frequently allows a minimally invasive removal of batch effects or of particularly highly expressed genes with an unwanted dominating effect on the clustering.

Next, the distance matrix is computed as the input for clustering and outlier identification. This can be done with or without imputing gene expression from nearest neighbours (see example below for imputing). RaceID offers various alternative metric, e.g. spearman, kendall, and euclidean, as well as measure of proportionality (rho and phi from the propr package).

sc <- compdist(sc,metric="pearson")

This function computes the distance matrix based on highly variable genes by default. If all genes should be used, then the parameter FSelect needs to be set to FALSE. On this distance matrix clustering is performed:

sc <- clustexp(sc)

To infer the initial cluster number, this function computes the average within-cluster dispersion up to a number of clusters specified by the clustnr arguments, which equals 30 by default. If more major populations are expected, this parameter needs to be set to higher values which will increase the run time. The initial cluster number only serves as a rough estimate, since the subsequent outlier identification will infer additional clusters. The internal inference of the cluster number and the evaluation of cluster stability by the computation of Jaccard's similarity is done on all cells by default. For large datasets it is reasonable to subsample a limited number of cells, by setting the samp argument, e.g., to 1,000. In this case, the cluster number is inferred based on this smaller subset and Jacccard's similarity is not computed by bootstrapping but by sub-sampling. For k-medoids clustering, subsetting will imply almost deterministic clustering partitions, if the sample size approaches the size of the dataset. Therefore, samp should be signicantly smaller then the size of the dataset. Otherwise, bootstrapping is better for assessing the cluster stability. Subsampling can be expected to give a good estimate of the number of major clusters. Additional small clusters which might have been missed by the sampling can be reintroduces during the outlier identification step.

The inferred cluster number can be inspected in a saturation plot, which shows the decrease of the average within-cluster dispersion with increasing cluster number. If this decrease becomes constant, saturation is reached. The automatically chosen cluster number is detected such that the decrease is equal to the decrease upon further increasing the cluster number within the error bars:

plotsaturation(sc,disp=FALSE)

The average within-cluster dispersion can also by plotted:

plotsaturation(sc,disp=TRUE)

The cluster stability as assessed by Jaccard's similarity should also be inspected:

plotjaccard(sc)

In this example, the automated criterion overestimated the cluster number leading to instability as indicated by low Jaccard's similarity. Based on visual inspection of the average within-cluster dispersion as a function of the cluster number, we manually set the cluster number to 7 without recomputing the saturation behaviour.

sc <- clustexp(sc,cln=7,sat=FALSE)

This function perform k-medoids clustering by default. K-means clustering or hierarchical clustering can be chosen with the FUNcluster argument. For very large datasets, hierarchical clustering leads to significantly smaller run time.

Subsequently, outliers in the initial k-medoids clusters are identified based on an internally computed background model for the expected gene expression variability and the assumption that transcript counts follow a negative binomial distribution defined by the mean and the variance of the expression of each gene in a cluster. Outlier genes will be in the tail of this distribution at a p-value defined by the probthr parameter (1e-3 by default), and outlier cells require the presence of a number of outlier genes defined by the outlg parameter (2 by default).

sc <- findoutliers(sc)

In contrast to previous versions, outlier genes are inferred from non-normalized transcript counts, which follow a negative binomial distribution modeling the joint technical and biological variability. The assumption of a negative binomial distribution was demonstrated for raw transcript (UMI) count data, but is not strictly valid for normalized expression values (Grün, Kester, and van Oudenaarden 2014). Hence, RaceID does not require normalization, since the correlation-based metric for the computation of the distance object is also independent of the normalization. Normalizaion is only relevant when using, e.g., the euclidean metric for the derivation of the distance matrix. RaceID applies a simple size normalization for data representation and follow-up analyses.

The background noise model can be inspected:

plotbackground(sc)

The number of outliers as a function of the p-value can be plotted:

plotsensitivity(sc)

Another way of checking the presence of outliers is the inspection of a barplot of p-values across all cells in each cluster:

plotoutlierprobs(sc)

A heatmap of cell-to-cell distances grouped be the final clusters inferred based on the original clusters and the outliers allows visual inspection of the clusters:

clustheatmap(sc)

##  [1]  5  4 11  7  9  1  2  3 10  6  8

This function is not recommended for very large datasets, since it produces similarly large plotting output.

The best way of visualising the detetcted cell types is plotting cells and clusters in a two-dimensional reduction representaion. RaceID can compute a t-SNE map

sc <- comptsne(sc)

or a k-nearest neighbour graph layout utilizing the Fruchterman-Rheingold algorithm:

sc <- compfr(sc,knn=10)

In this example, the number of nearest neighbours was chosen to be 10. In general, different values for knn should be tested to find an ideal layout.

RaceID further allows computation of a umap dimensional reduction representation:

sc <- compumap(sc)

Parameters for umap can be given as additional argument (see help function).

The t-SNE map can be plotted by

plotmap(sc)

The Fruchterman-Rheingold layout can be plotted by

plotmap(sc,fr=TRUE)

The umap representation can be plotted by

plotmap(sc,um=TRUE)

Maps can be changed for both t-SNE and Fruchterman-Rheingold algorithm by initializing the rseed argument of the comptsne or compfr function with a random number.

The dimensional reduction maps can be inspected, e.g., for the localization of (a subset of) samples included in the analysis:

types <- sub("(\\_\\d+)$","", colnames(sc@ndata))
subset <- types[grep("IV|V",types)]
plotsymbolsmap(sc,types,subset=subset,fr=TRUE)

Expression of a gene of interest or aggregated expression for a group of genes can be highlighted in the dimensional reduction representation:

plotexpmap(sc,"Lyz1",logsc=TRUE,fr=TRUE)

g <- c("Apoa1", "Apoa1bp", "Apoa2", "Apoa4", "Apoa5")
plotexpmap(sc,g,n="Apoa genes",logsc=TRUE,fr=TRUE)

It is also possible to highight expression only for a subset of cells, e.g. a particular batch or sample:

sample <- colnames(sc@ndata)[grep("^I5d",colnames(sc@ndata))]
plotexpmap(sc,"Lyz1",cells=sample,logsc=TRUE,fr=TRUE)

For the murine intestinal example data, inspetion of known marker genes suggests that cluster 2 and 3 correspnd to Lgr5-expressing intestinal stem cells. Cluster 2 is proliferative as indicated by up-regulation of Mki67, Cluster 1 comprises transiently amplifying progenitors biased towards absorptive entorytes in cluster 4 marked by Apolipoproteins. Cluster 7 comprises Lysozyme-expressing Paneth cells while Mucus-producing goblet cells constitute clusters 6 and 10. To inspect clusters in more detail, differentially expressed genes can be inferred by an internal approach akin to DESeq2 (Love, Huber, and Anders 2014) but with a dispersion estimated globally from the background model of RaceID.

For instance, to obtain differentially expressed genes within cluster 9 compared to all other cells:

d  <- clustdiffgenes(sc,4,pvalue=.01)
dg <- d$dg
head(dg,25)

##          mean.ncl   mean.cl        fc           pv         padj
## Apoa1   0.6362058 21.398348 33.634315 0.000000e+00 0.000000e+00
## Apoa4   0.5702229 13.588646 23.830412 0.000000e+00 0.000000e+00
## Fabp1   0.7444607 21.348669 28.676691 0.000000e+00 0.000000e+00
## Fabp2   1.6023268 27.105795 16.916521 0.000000e+00 0.000000e+00
## Reg1    0.2360083  6.856378 29.051431 0.000000e+00 0.000000e+00
## Reg3b   0.4229290  7.360573 17.403803 9.665345e-88 3.371594e-85
## Gstm3   0.4901265  7.645545 15.599126 2.408636e-85 7.201821e-83
## Rbp2    0.1837525  4.865970 26.481103 1.340014e-73 3.505811e-71
## Reg3g   0.5903128  6.381428 10.810248 4.488428e-63 1.043809e-60
## Plac8   2.2049822 12.625809  5.726037 1.040516e-56 2.177800e-54
## Spink3  0.1808018  3.908387 21.616969 6.661021e-56 1.267411e-53
## Sis     0.2079850  4.014474 19.301750 3.388549e-55 5.910194e-53
## Anpep   0.1337270  3.602236 26.937242 4.383642e-55 7.057664e-53
## Guca2b  0.4864784  5.159079 10.604950 1.758228e-54 2.628551e-52
## Slc5a1  0.3030379  4.362130 14.394669 9.178274e-54 1.280675e-51
## Ces2e   0.8445971  6.651895  7.875820 9.650062e-53 1.262349e-50
## Apoc2   0.2539176  4.025073 15.851887 4.568302e-51 5.391473e-49
## Aldob   1.4071637  8.711907  6.191111 4.636718e-51 5.391473e-49
## Aldh1a1 0.1080674  2.965912 27.445004 4.730617e-47 5.211148e-45
## Crip1   1.3463788  7.794775  5.789437 4.182806e-45 4.377307e-43
## Mep1b   0.1180375  2.923125 24.764369 7.826703e-45 7.800614e-43
## Mttp    0.3015435  3.758237 12.463333 4.122038e-43 3.921557e-41
## Cyp3a11 0.1544745  3.071999 19.886768 6.232458e-43 5.671537e-41
## Clec2h  0.1374732  2.930410 21.316236 1.195432e-42 1.042516e-40
## Ckmt1   0.7801578  5.336224  6.839929 1.340511e-42 1.122276e-40

The differentially expressed genes (in this example only the up-regulated ones with a fold change >1) can be plottted in a heatmap, which can highlight the clusters and samples of origin:

types <- sub("(\\_\\d+)$","", colnames(sc@ndata))
genes <- head(rownames(dg)[dg$fc>1],10)
plotmarkergenes(sc,genes,samples=types)

The heatmap can also be ordered by cell names (i.e. by batch or sample) by setting order.cells to TRUE. With the input parameters cl and cells, the heatmap can be restricted to a subset of cluster or cells, respectively.

plotmarkergenes(sc,genes,cl=c(2,6,7,8,10),samples=types,order.cells=TRUE)

In this example, no inter-sample differences are apparent and all samples contribute to each cluster.

It is also possible to plot gene expression across clusters or samples using a dot plot, where the diameter represents the fraction of cells in a sample or cluster expressing a gene, and the color indicates the expression level or z-score.

For example, the following commmand plots expression z-scores across clusters 2,6,7,8, and 10:

fractDotPlot(sc, genes, cluster=c(2,6,7,8,10), zsc=TRUE)

Expression on a log2 scale across samples I, II, and III can be plotted by

samples <- sub("(\\d.+)$","", colnames(sc@ndata))
fractDotPlot(sc, genes, samples=samples, subset=c("I","II","III"), logscale=TRUE)

A differential gene expression analysis between two defined sets of cell, e.g., two (groups of) clusters can be performed:

A <- names(sc@cpart)[sc@cpart %in% c(2,4)]
B <- names(sc@cpart)[sc@cpart %in% c(3)]
x <- diffexpnb(getfdata(sc,n=c(A,B)), A=A, B=B )
plotdiffgenesnb(x,pthr=.05,lthr=.5,mthr=-1,Aname="Cl.2",Bname="Cl.3,5",show_names=TRUE,padj=TRUE)

See the paragraphs below for additional options of RaceID analyses and parameter choices ideal for analysing large datasets.

StemID: projection mode

StemID is an algorithm for the inference of lineage trees and differentiation trajectories based on pseudo-temporal ordering of single-cell transcriptomes. It utilizies the clusters predicted by RaceID and thus requires to run RaceID first. The algorithm was originally published along with RaceID2 (Grün et al. 2016) and the improved current version StemID2 was published later (Herman, Sagar, and Grün 2018). In a nutshell, StemID infers links between clusters which are more populated with intermediate single-cell transcriptomes than expected by chance. To assign cells to inter-cluster links, two fundamentally different strategies are available (see nmode argument below). The first strategy considers the projection of a vector connecting a cluster medoid to a member cell of the same cluster onto the links from the medoid of its cluster to the medoids of all other clusters. The longest projection identifies the link this cell is assigned to and defines the projection coordinate. The second (nearest neighbour) mode identifies for a cell in a given cluster the number of k nearest neighbours in each other cluster and assigns the cell to the link with the cluster where the average distance to these k nearest neighbours is minimized. The coordinate on the link is inferred as in the first mode. A faster approximate version of the first mode is also implemented.

As a first step, a lineage tree object for the StemID analysis needs to be initialized with an SCseq object obtained from a RaceID analysis:

ltr <- Ltree(sc)

Next, the transcriptome entropy of cell needs to be calculated. This is used by the StemID algorithm for the prediction of the stem cell type, based on maximum transcriptome entropy and maximum number of links to other clusters.

ltr <- compentropy(ltr)

In the subsequent step, cells are projected onto inter-cluster links. Cells are assigned to a link based on minimum distance to k nearest neighbours (nmode=TRUE) or based on the maximum projection coordinate (nmode=FALSE). Only clusters with >cthr cells are included in the analysis. If fr=TRUE then the Fruchterman-Rheingold layout will be used for representation of the inferred lineage tree, and if um=TRUE a UMAP representation will be used. Otherwise, representation will be done in t-SNE space. The computation of the lineage graph is independent of the dimensional reduction method which is only used for visualization.

ltr <- projcells(ltr,cthr=5,nmode=FALSE,fr=TRUE)

If projections are used for link determenation (nmode=FALSE), the derivation of link significance is done by comparing to the link population after randomizing cell positions within the boundaries imposed by the gene expression manifold. This is done by bootstrapping using 500 randomizations by default. More randomizations are possible, but obviously linearly increase runtime.

ltr <- projback(ltr,pdishuf=100)

Based on this information, a lineage graph is computed to approximate the lineage tree (a tree structure is not strictly imposed).

ltr <- lineagegraph(ltr)

Finally, link p-values are computed and a threshold pthr is applied on the p-values:

ltr <- comppvalue(ltr,pthr=0.1)

The resulting graph can be plotted, overlaid with a dimensional reduction representation (Fruchterman-Rheingold or t-SNE, see projcells). To retain only the more populated links, a cutoff scthr on the linkscore can be applied, e.g. 0.2:

plotgraph(ltr,scthr=0.2,showCells=FALSE,showMap=TRUE)

To predict the stem cell, the StemID score can be computed and visualized:

x <- compscore(ltr,scthr=0.2)

StemID offers a number of plotting functions to inspect the results. RaceID performs clustering using Pearson's correlation as a metric by default. The StemID projections require a Euclidean space and thus an initial embedding into a high-dimensional space is performed by classical multidimensional scaling. To inspect how well cell-to-cell distances are preserved, a histogram of the log-ratios between the original and transformed distances can be plotted:

plotdistanceratio(ltr)

The StemID prediction can be compared to a minimal spanning tree of the cluster medoids:

plotspantree(ltr)

The cell projections onto the links can be directly compared with this minimal spanning tree:

plotspantree(ltr,projections=TRUE)

All linkscores and fold enrichments of the population on a link can be plotted as heatmaps:

plotlinkscore(ltr)

projenrichment(ltr)

The (z-scores of the) projections of all cells from a given cluster across all links can be plotted as heatmap, e.g. for cluster 3:

x <- getproj(ltr,i=3)

All cells on two different branches originating from the same cluster, e.g. cluster 3 cells on the links to cluster 1 and 8, can be extracted for the computation of differentially expressed genes:

x <- branchcells(ltr,list("1.3","3.8"))
head(x$diffgenes$z)

##   Prss32     Nip7    Dhrs4    Srsf3   Tm4sf5  Tsc22d1 
## 3.224859 2.814688 2.368598 2.335748 2.200576 2.187001

The cells on the two branches can be plotted as additional clusters in the dimensional reduction representation:

plotmap(x$scl,fr=TRUE)

StemID: nearest-neighbour mode

Since the randomizations of cell positions for the derivation of link significance require long computation time, and the projection-based method leads to some weak links which are potentially false positives (and can be filtered out based on linkscore), the nearest-neighbour-based method has now been selected to be the default mode of StemID. This method is more robust and fast even on large datasets. The downside is that it will miss some weak links, i.e. lead to more false negatives in comparison to the projection mode.

First, a lineage tree object needs to be initialized followed by the calculation of the transcriptome entropy of each cell.

ltr <- Ltree(sc)
ltr <- compentropy(ltr)

Next, cell projection are calculated with the parameter nmode=TRUE, which is also the default value:

ltr <- projcells(ltr,cthr=5,nmode=TRUE,fr=TRUE,knn=3)

The knn parameter determines how many nearest neighbours are considered in each cluster for determining the link assignment: the distance two each cluster is calculated as the average across the distance of a cell to the knn nearest neighbours within each other cluster, and the cell is assigned to the link with the cluster minimizing this distance. Now, the lineage tree is inferred and the p-values for the links are calculated based on a binomial model:

ltr <- lineagegraph(ltr)
ltr <- comppvalue(ltr,pthr=0.05)

The resulting lineage graph can be inspected and reveals the expected trajectories connecting the stem cells (cluster 2 and 3 of cycling and quiescent cells, respectively) to enterocytes (cluster 4) via transiently amplifying progenitors (cluster 1), to Paneth cells (cluster 7), and to goblet cells (cluster 6). The StemID score suggests stem cell identity for clusters 2 and 3:

plotgraph(ltr,showCells=FALSE,showMap=TRUE)

x <- compscore(ltr)

RaceID Options:

Batch effect removal

RaceID offers the possibility of batch correction utilizing an internal method or the published mnnCorrect function from the batchelor package (Haghverdi et al. 2018). The batchelor package needs to be separately installed from bioconductor if this mode is desired. In order to apply batch effect removal, a list with a vector of cell ids for each batch needs to be defined, e.g.:

n <- colnames(intestinalData)
b <- list(n[grep("^I5",n)],n[grep("^II5",n)],n[grep("^III5",n)],n[grep("^IV5",n)],n[grep("^V5",n)])

This list is provided as input to the filterdata function, and the bmode argument is initialized with the desired method, i.e. mnnCorrect or RaceID. The latter method simply compares the local neigbourhood, i.e. the set of k nearest neighbours, for each cell between two batches and identifies the neighbourhood of the two batches with the smallest average distance. A differential gene expression analysis between the closest neighbourhoods of two batches yields batch associated genes. The next batch is then compared in the same way to the merged dataset of the previous batches. Batches are compared and successively merged according to the order they are provided in b. An additional input parameter knn controls the number of nearest neighbours, i.e. the size of the neighbourhood.

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000,LBatch=b,bmode="RaceID",knn=10)

## RaceID3 batch correction... 
## Adding batch:  2 
##  
## Adding batch:  3 
##  
## Adding batch:  4 
## Rplp2 
## Adding batch:  5 
## Rplp2

The filterdata function will identify all batch-associated genes, which are stored in the filterpar$BGenes slot of the SCseq object. All genes that correlate to a batch gene are removed for the computation of a distance object. This is a minimally invasive strategy in comparison to mnnCorrect, which works well if batches are very similar, such as datasets produced from the same material using the same single-cell RNA-seq technology.

Imputing of gene expression

RaceID also offers optional imputing of gene expression, which can be useful if gene expression differences between cell types or states are governed only by lowly expressed genes and are difficult to resolve by clustering based on raw counts. If imputing is desired, the knn argument needs to be initialized with the number of nearest neighbours used for imputing gene expression :

sc <- compdist(sc,knn=5,metric="pearson")

Now, for each cell the knn nearest neighbours are used to infer a local negative binomial for each gene, utilizing a weighted mean expression and the internal RaceID noise model to obtain the corresponding negative binomial. The weights are derived by quadratic programming, computing the expression vector of a cell as a linear combination of its knn nearest neighbours. The cell itself contributes with the same weight as the aggregated weights of the nearest neighbours to the weighted mean expression. With the help of this negative binomial the tail probability of each gene is derived across all knn nearest neighbours. The geometric means of these tail probabilities are finally applied as a weights for each nearest neighbours in the calculation of the imputed gene expression. This strategy ensures that gene expression can only be learned from nearest neighbours following the same transcript count distributions.

After this, all steps remain the same. Imputing often helps to improve cluster discrimination and stability. Importantly, distances derived from imputed gene expression are only used for clustering. The outlier identification relies on unimputed gene expression, and hence can correct spurious clusters produced from imputed values.

sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- compfr(sc)
sc <- comptsne(sc)
plotmap(sc,fr=TRUE)

If batch effect removal has been applied, the remaining batch effect can be checked by plotting symbols representig the sample of origin:

types <- sub("(\\_\\d+)$","", colnames(sc@ndata))
plotsymbolsmap(sc,types,fr=TRUE)

Ideally, all samples should intermingle in each clusters. Imputed gene expression can be plotted by setting the imputed argument to TRUE. Otherwise, unimputed values are shown.

plotexpmap(sc,"Mki67",imputed=TRUE,fr=TRUE)
plotmarkergenes(sc,c("Clca4","Mki67","Defa24","Defa20","Agr2","Apoa1"),imputed=TRUE,samples=types)

An expression matrix with imputed expression values can be extracted for further analysis:

k <- imputeexp(sc)

Removing variability by regression

RaceID can also regress out variability associated with particular sources such as batches or cell cycle. If batch effect removal has been done by the filterdata function with bmode="RaceID" then this function can further regress out residual variability remaining after batch associated genes have been discarded for the distance computation. In the case, the argument Batch has to be set to TRUE and vars can be left empty if no further sources of variability should be regressed out. Batch effects can also be regressed out directly without prior removal using the filterdata function:

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
vars <- data.frame(row.names=colnames(intestinalData),batch=sub("(\\_\\d+)$","",colnames(intestinalData)))
sc   <- varRegression(sc,vars)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)

However, regression also leads to loss of biological variation and this step is only recommended if variability associated with a particular variable is a strong confounding factor and cannot be removed by other means.

Prior dimensional reduction and removal of variability by PCA/ICA

After running the filterdata function, a prior dimensional reduction using PCA or ICA can be performed using the CCcorrect function. This function can also be provided with a list vset of sets of genes, and principal components with loadings enriched in any of these sets will be discarded. Another options is to provide a list CGenes of genes, and sets to be tested for enrichment in each component are derived as the groups of all genes correlating to a component in FGenes. The CCcorrect function predicts a dimension for the prior dimensional reduction based on an ellbow function of the explained variability as a function of the number of components. This can be inspected by the plotdimsat function and manually adjusted:

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
sc <- CCcorrect(sc,dimR=TRUE)
plotdimsat(sc)
plotdimsat(sc,change=FALSE)
sc <- filterdata(sc,mintotal=2000)
sc <- CCcorrect(sc,nComp=9)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)

Rerunning CCcorrect requires to run filterdata first, because otherwise the dimensional reduction scores in the dimRed slot will be subject to a second dimensional reduction, which is not desired. All sub-sequent steps remain unaltered.

Inferring stable clusters by random forests analysis

RaceID provides the option to run a random forests based reclassifiction, in order to obtain a stable clustering partition. This can be done on the final clustering after running findoutliers:

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
sc <- compdist(sc,metric="pearson")
sc <- clustexp(sc)
sc <- findoutliers(sc)
sc <- rfcorrect(sc)
sc <- comptsne(sc)
sc <- compfr(sc)
plotmap(sc)

However, this is normally not required, since the improved outlier detection of the current version leads to stable clusters which do not change substantially after applying this function. Running rfcorrect takes very long for large datasets and should be omitted in this case.

Parameters: Large Datasets

For large dataset (>25k cells), we recommend using the VarID approach (see below) for density based clustering which does not require computation and storage of a huge distance matrix. For datasets below this size we recommend running RaceID/StemID with specific parameter settings. Since the correlation metric does not require dataset normalization, this approach has certained advantages compared to VarID. However, the latter can also be run with a correlation matrix. In the following, we discuss a few paramters critical for the runtime of RaceID/StemID on large datasets.

sc <- SCseq(intestinalData)
sc <- filterdata(sc,mintotal=2000)
sc <- compdist(sc)

Preferentially, clustering should be done by FUNcluster="kmedoids" but "hclust" often gives similar results and is significantly faster.

sc <- clustexp(sc,samp=1000,FUNcluster="hclust")

For the determination of the cluster number and the inference of Jaccard's similarity, sub-sampling should be applied by setting the subset size samp to an integer number. A good choice could be 10-25% of the total number of cells. For large datasets this should be sufficient to discriminate the most abundant cluster, and additional smaller clusters will automatically be identified in the next step using the findoutliers function. It is important that the clustnr argument is much larger then the expected number of clusters. If stability analysis is not wanted, one can set bootnr=1. If clustering is re-run with a specific cluster number, e.g. cln=12, then the saturation criterion should be disabled by setting sat=FALSE.

If the cluster granularity is too fine or too coarse, the probthr argment can be decreased or increased, respectively, e.g.:

sc <- findoutliers(sc,probthr=1e-4)

It is adviced to change the perplexity argument to larger values, when computing the t-SNE map for large datasets, e.g. perplexity=200. However, a large perplexity will return an error for small datasets.

sc <- comptsne(sc,perplexity=100)
plotmap(sc)

The Fruchterman-Rheingold layout critically depends on the number knn of nearest neighbours, and different values should be tested:

sc <- compfr(sc,knn=10)
plotmap(sc,fr=TRUE)

For a sub-sequent StemID analysis of very large datasets, the nearest-neighbour mode (see above) will help to reduce the runtime.

Inspecting pseudo-temporal gene expression changes

To inspect pseudotemporal expression profiles, functions provided by the FateID package can be utilized. First, the trajectory needs to be defined based on a sequence of clusters. This sequence should ideally correspond to a trajectory predicted by StemID2, i.e. the clusters should be connected by a series of significant links. However, non-linked clusters can also be included. A pseudo-temporally ordered vector of cells along a StemID trajectory can be extracted with the cellsfromtree function from an Ltree object:

n <- cellsfromtree(ltr,c(2,1,4))

The list n contains a vector n$f of cell ids ordered by their projection coordinate along the trajectory reflecting differentiation pseudotime. This vector and a filtered or unfiltered expression matrix can be used as input for pseudo-temporal gene expression analysis. The filtered expression data used for RaceID3 can be extracted with the getfdata function:

x <- getfdata(ltr@sc)

Additional filtering and subsetting of the gene expression matrix for cells on the trajectory, n$f, is done in the next step, utilizing functions from the FateID package:

library(FateID)
fs  <- filterset(x,n=n$f)

The filterset function can be used to eliminate lowly expressed genes on the trajectory from the subsequent analysis and has two additional arguments to discard genes, which are not expressed at a level of minexpr or higher in at least minnumber of cells. The function returns a filtered expression data frame with genes as rows and cells as columns in the same order as in the input vector n. In the next step, a self-organizing map (SOM) of the pseudo-temporal expression profiles is computed:

s1d <- getsom(fs,nb=1000,alpha=.5)

This map provides a grouping of similar expression profiles into modules. The first input argument is again an expression data frame. In this case, we use the filtered expression table generated by the filterset function to retain only genes that are expressed on the trajectory under consideration. Pseudo-temporal expression profiles along the differentiation trajectory of interest are computed after smoothing by local regression with smoothing parameter alpha.

This function returns a list of the following three components, i. e. a som object returned by the function som of the package som, a data frame x with smoothed and normalized expression profiles, and a data frame zs of z-score transformed pseudo-temporal expression profiles.

The SOM is then processed by another function to group the nodes of the SOM into larger modules and to produce additional z-score transformed and binned expression data frames for display:

ps  <- procsom(s1d,corthr=.85,minsom=3)

The first input argument is given by the SOM computed by the function getsom. The function has two additional input parameters to control the grouping of the SOM nodes into larger modules. The parameter corthr defines a correlation threshold. If the correlation of the z-scores of the average normalized pseudo-temporal expression profiles of neighboring nodes in the SOM exceeds this threshold, genes of the neighboring nodes are merged into a larger module. Only modules with at least minsom genes are kept. The function returns a list of various data frames with normalized, z-score-transformed, or binned expression along with the assignment of genes to modules of the SOM (see man pages for details).

The output of the processed SOM can be plotted using the plotheatmap function. First, in order to highlight the clustering partition y the same color scheme as in the SCseq object can be used:

y    <- ltr@sc@cpart[n$f]
fcol <- ltr@sc@fcol

Now, the different output data frames of the procsom function can be plotted.

Plot average z-score for all modules derived from the SOM:

plotheatmap(ps$nodes.z,xpart=y,xcol=fcol,ypart=unique(ps$nodes),xgrid=FALSE,ygrid=TRUE,xlab=FALSE)

Plot z-score profile of each gene ordered by SOM modules:

plotheatmap(ps$all.z,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE)

Plot normalized expression profile of each gene ordered by SOM modules:

plotheatmap(ps$all.e,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE)

Plot binarized expression profile of each gene (z-score < -1, -1 < z-score < 1, z-score > 1):

plotheatmap(ps$all.b,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE)

In order to inspect genes within individual modules of the SOM, these genes can be extracted given the number of the module. The module numbers are contained in the return value nodes of the procsom function and can be extracted, e. g. for module number 24:

g <- names(ps$nodes)[ps$nodes == 24]

The average pseudo-temporal expression profile of this group can be plotted by the function plotexpression:

plotexpression(fs,y,g,n$f,col=fcol,name="Node 24",cluster=FALSE,alpha=.5,types=NULL)

In the same way it is possible to plot expression profiles of individual genes, e.g.:

plotexpression(fs,y,"Clca4",n$f,col=fcol,cluster=FALSE,alpha=.5,types=NULL)

It is also possible to highlight the data points as specific symbols, for example reflecting batches, by using the types argument:

plotexpression(fs,y,g,n$f,col=fcol,name="Node 24",cluster=FALSE,alpha=.5,types=sub("\\_\\d+","",n$f))

VarID

VarID is a novel algorithm utilizing a k nearest neighbour graph derived from transcriptome similarities measured by a distance matrix with an arbitrary metric. The key idea is the inference of local neighbourhoods of cells in the same state, i.e. all k nearest neighbours follow the same transcript count distribution as the cell to which they are nearest neighbours. The computation starts with k nearest neighbours and infers a negative binomial distribtion for the transcript counts of each gene, derived from the mean-variance dependence computed the same way as for the RaceID background model. The mean expression is derived from a weighted average across the k nearest neighbours, where the weights are inferred using quadratic programming, representing the cell as a linear combination of its neighbours. The contribution of the "central" cell itself can be controlled by a given scale parameter alpha. Typical values of alpha should be in the range of 0 to 10. The default value is NULL. In this case, alpha is inferred by an optimization, i.e., alpha is minimized under the constraint that the gene expression in a cell does not deviate more then one standard deviation from the predicted weigthed mean, where the standard deviation is calculated from the predicted mean using the background model (the average dependence of the variance on the mean expression). With the negative binomial distribution inferred from the local means and the global mean-variance dependence, the probabilities of observed transcript counts for all genes can be computed in each neighbour, and, after multiple testing correction, the minimum probability across all genes serves as a proxy for the probability of neighbours being in the same state. If this probability falls below a user-defined threshold, the neighbour is discarded. After this procedure is applied to the nearest neighbours of each cell, a pruned k nearest neighbour graph is obtained, in which the remaining links indicate that neighbours follow the same transcript count distribution for all genes. These local neighbourhoods can be used to compute local properties such as gene expression variability, which in turn can be correlated to cell state or predicted fate, derived, e.g., from RaceID or FateID analysis, respectively.

Although VarID analysis can be performed independently of the RaceID data object, an integration with a RaceID analysis permits convenient visualization of the results. For this reason, we here demonstrate an integrated pipeline, but explain at each step, how computation can be performed given minimal input data, i.e. an expression matrix only.

The VarID method requires unique molecular identifier (UMI) counts, which have been shown to follow a negative binomial distribution, a property also crucial for the RaceID analysis.

Given such a UMI matrix with cells as columns and genes as rows, we could first initialize a RaceID object, and perform cell and gene filtering to retain only expressed genes and high quality cells. This is followed by the computation of a distance matrix, which can be used for the VarID analysis.

sc <- SCseq(intestinalData)
sc <- filterdata(sc)
# calculate distance matrix
sc <- compdist(sc)

However, the preferred mode of running VarID does not require a distance matrix, and relies on k-nearest neighbor analysis in PCA space (see below). For large datasets, the computation of a distance matrix is prohibitive due to a memory requirement scaling with N^2 if N is the number of cells. If the metric argument of the compdist function is not initialized otherwise, the distance matrix is calculated as 1 - Pearson's correlation of the transcriptomes.

The filtered, unnormalized, and otherwise unprocessed raw UMI count matrix can then be retrieved by

d <- getExpData(sc)

and the distance matrix can be obtained by

distM <- sc@distances

VarID can handle sparse matrices, so d can be a sparse matrix defined with the Matrix package, if created independently of RaceID. The getExpData function returns a sparse matrix. No distance matrix input is required for the distM argument of pruneKnn. In fact, if given it will be ignored as long as large is set to TRUE.

With this input the derivation and pruning of a k nearest neighbour graph can be performed:

res <- pruneKnn(d,distM=distM,large=FALSE,metric="pearson",genes=NULL,knn=10,alpha=NULL,no_cores=1,FSelect=FALSE)

In principle, only d is required as input data for the VarID analysis and the distance matrix can be computed internally if the distM argument is omitted. In this case, the metric argument defines the metric utilized for the distance matrix calculation.

The VarID analysis can also be performed without computing a distance matrix, since computation and storage of distance matrices is not feasible for very large datasets (>30k cells) on a conventional work station due to memory limitations. As an alternative, the pruneKnn function can be executed with the large argument set to TRUE, which is the default setting and usually shows better performance. In this case, the input expression matrix is first optionally corrected for variability associated with total transcript count per cell by a negative binomial regression (if regNB=TRUE as by default). The mean dependence of the regression parameters is smoothed, and Pearson residual are computed. The matrix of Pearson residuals (default setting), or, alternatively, the raw count matrix then undergoes a dimensional reduction by a principle component analysis (PCA), including the top pcaComp principle components (top 100 principle components by default). This reduced feature matrix is subsequently used for a fast k nearest neighbour search by the get.knn function from the FNN package based on euclidean distances, i.e., a correlation-based metric can no longer be used. The algorithm parameter of the pruneKnn function enables the selection of the search algorithm to apply. In general, it is recommended to run VarID with large=TRUE and regNB=TRUE (default settings). A RaceID object can be utilized to facilitate the analysis, but in this mode the compdist function does not need to be executed:

sc <- SCseq(intestinalData)
sc <- filterdata(sc)
d <- getExpData(sc)
res <- pruneKnn(d,large=TRUE,pcaComp=100,regNB=TRUE,genes=NULL,knn=10,alpha=1,no_cores=1,FSelect=FALSE,ngenes=2000)

The genes argument allows to limit the analysis to a subset of genes, knn defines the number of nearest neighbours, alpha is the scale parameter for the "central" cell, and FSelect allows internal feature selection. The background model is calculated internally in the same way as done in RaceID and feature selection follows the same logic. The function further permits multithreading and the number of cores can be defines using the no_cores argument. If this is set to NULL, the function detects the number of available cores and uses this number minus two to leave resources for other tasks running on the same machine. Calling help for this function explains all parameters in detail (as for all other functions).

This function internally fits a background model of gene expression variability, which can be inspected:

plotBackVar(res)

This model can also be computed separately:

bg <- fitBackVar(d)
plotBackVar(bg)

The function returns a list with the used distance matrix, a matrix with the indices of the k nearest neighbours for each cell, and a matrix with the corresponding probabilities. Finally, the background model is also returned.

From this object, a sparse distance matrix can optionally be created with non-zero entries only for k nearest neighbours with probabilities larger or equal than pvalue:

y <- createKnnMatrix(res,pvalue=0.01)

This is not needed as input for the subsequent step, but enables independent analyses by the user.

This pruned k nearest neighbour information can be used as input for graph inference, Louvain clustering, and the derivation of a Fruchterman-Rheingold graph layout:

cl <- graphCluster(res,pvalue=0.01)

The function returns a list of three components: a graph object graph, an object louvain containing informarmation on the clustering, and a data.frame fr containing a Fruchterman-Rheingold layout computed from the pruned k nearest neighbour graph, all derived using the igraph package.

By setting the use.weights argument of graphCluster to TRUE (default), clustering can be performed on the pruned knn graph with weighted links, where weights correspond to the link probabilities. This may lead to the emergence of too many clusters (which could happen, e.g., due to variability in the link probabilities when integrating different experimental batches). In this case, the use.weights parameter can be set to FALSE, and Louvain clustering is performed on a pruned knn graph with equal weights for all links.

The pvalue parameter determines the probability cutoff for link pruning, i.e., links with probabilities lower than pvalue are removed.

To visualize the clustering with the help of RaceID, one can update the Scseq object with the results:

sc <- updateSC(sc,res=res,cl=cl)

This function call will initialize the cpart and the cluster$kpart slots with the clustering partition in cl$louvain$membership, and the fr slot with the Fruchterman-Rheingold layout.

The clustering can readily be inspected in the Fruchterman-Rheingold layout:

plotmap(sc,fr=TRUE)

After computing a t-SNE map or, the clustering can be also be visualized:

sc <- comptsne(sc)
plotmap(sc)

Alternatively, a umap representation can be computed for visualization:

sc <- compumap(sc)
plotmap(sc,um=TRUE)

The pruned k nearest neighbour graph allows the derivation of transition probabilities between clusters, based on the remaining links connecting two clusters (at a minimum probability given by pvalue) and the inferred probabilities of these links:

probs <-transitionProbs(res,cl,pvalue=0.01)

If RaceID is used for visualization and has been updated with the clustering using the updateSC function (see above), the transition probabilities can be visualized in the dimensional reduction representation:

plotTrProbs(sc,probs,tp=.5,prthr=0,cthr=0)

The argument prthr allows discarding all links with a transition probability <= prthr and the th cthr argument sets a cluster size threshold. Only clusters with > cthr cells are shown.

The inference of homogenous local neighbourhoods, given by a cell and its remaining nearest neighbours after pruning, permits the computation of local quantities across such sets of similar cells. To investigate the dynamics of gene expression noise across cell states, we derive local estimates of gene expression variability. The systematic dependence of gene expression variance on the mean expression is a central challenge for the derivation of differential variability between two states. VarID solves this problem using one of two alternative strategies. The first strategy is based on a negative binomial regression on the total transcript count in each cell. The pearson residuals of this regression serve as an expression estimate corrected for the mean-dependence.To avoid overfitting, the regression parameters as a function of mean expression are smoothed by a local polynomial regression. Local variability is then computed as the variance of pearson residuals across the pruned k nearest neighbours. The second approach relies on infering a baseline variability, comprising biological and technical noise, reflecting the systematic dependency, and regresses out this trend, in order to estimate a corrected variability, which no longer depends on mean expression. In both cases the corrected variability can then be meaningfully compared across local neighbourhoods. The distribution of corrected variabilities can thus be compared between cell states, or clusters, in order to infer genes with differential variability.

The corrected variability estimate can be computed for the local neighbourhoods of all cells:

noise <- compNoise(d,res,regNB=TRUE,pvalue=0.01,genes = NULL,no_cores=1)

The default approach is the second approach, i.e., the direct correction of the variability by regressing out the baseline variability. Alternatively, the negative binomial regression can be applied by setting the parameter regNB=TRUE. The function returns the noise model (only used if regNB=FALSE) and the corrected noise estimates. If regNB=TRUE the function also return a list regData with the logarithmic UMI counts of all cells used for the regression, the pearson residuals, and the parameters inferred by the regression before and after smoothing.

If the regression has been computed by pruneKnn and the slot res$regData is not NULL, the pearson residuals from this step are used and the calculation is not repeated. However, it is also possible to determine neighborhoods with a given distance matrix and apply the regression only at this step for the noise inference.

If negative binomial regression has been applied (regNB=TRUE), the raw and smoothed regression parameters can be plotted:

plotRegNB(d,noise,par.nb="beta")

par.nb is the parameter to be plotted, and has to be one of "theta", "beta", or (Intercept)". theta is the disersion parameter and beta is the regression coefficient of the log10 total UMI count. (Intercept) is the intercept inferred by the regression.

To check if the negative binomial regression could remove the mean-dependence of the variance, the latter can be plotted:

plotPearsonRes(noise,log=TRUE)

The plot also shows Spearman's correlation coefficient between mean and variance of the Pearson residuals. If a systematic dependence is still observed, the noise should be estimated by the alternative method based on regressing out the baseline mean-dependence of the variance (compNoise with regNB=FALSE).

noise <- compNoise(d,res,regNB=FALSE,pvalue=0.01,genes = NULL,no_cores=1)

The dependence of the variance on the mean, and the fit to the baseline variability (used if regNB=FALSE) can be visualized before and after correction for the trend:

plotNoiseModel(noise)

plotNoiseModel(noise,corrected=TRUE)

The function baseLineVar can be used to extracted the corrected variability. See help function.

To visualize the noise level using RaceID, the SCseq object needs to be updated with the results from the noise computation:

sc <- updateSC(sc,noise=noise,flo=.1)

The flo argument sets a lower limit to the noise estimates, since, due to locality, the estimates can fall below the baseline level. This will alleviate the impact of outliers in for the subsequent analysis and visualization.

This initializes a slot noise of the SCseq object with gene by cell matrix of corrected noise etsimates. The noise levels of a given gene (or group of genes), can then be visualized in the same way as gene expression with the help of RaceID functions.

Gene expression (on logarithmic scale):

plotexpmap(sc,"Lgr5",logsc=TRUE)

Corrected gene expression variability (on logarithmic scale):

plotexpmap(sc,"Lgr5",logsc=TRUE,noise=TRUE)

The corrected noise can also be plotted in a cluster heatmap and directly compared to gene expression. After deriving a gene expression heatmap, the order of genes stored in ph (returned by the pheatmap function of the pheatmap package), can be applied to the noise heatmap in order to permit direct comparability:

genes <- c("Lyz1","Defa20","Agr2","Clca3","Muc2","Chgb","Neurog3","Apoa1","Aldob","Lgr5","Clca4","Mki67","Pcna")
ph <- plotmarkergenes(sc,genes=genes,noise=FALSE)

plotmarkergenes(sc,genes=genes[ph$tree_row$order],noise=TRUE,cluster_rows=FALSE)

With the clustering derived by graphCluster and the object returned from the compNoise function, genes with elevated noise levels in a cluster (or a set of clusters) can be inferred:

ngenes <- diffNoisyGenes(noise,cl,set=c(5),no_cores=1)
head(ngenes)

##            log2FC       pvalue
## Tac1    2.1105922 4.451282e-05
## Neurog3 1.9999921 2.260006e-04
## Chga    1.5452946 6.519589e-01
## Rplp2   1.3714260 1.352253e-11
## Cox4i1  0.9661081 5.824249e-09
## Ndufa10 0.9142775 4.452239e-06

This functions applies a t-test to compare variability in a cluster (or set of clusters) versus the remaining clusters. It is also possible to define a background set (bgr) of clusters.

The top varaible genes can be plotted in a heatmap:

genes <- head(rownames(ngenes),50)
ph <- plotmarkergenes(sc,genes=genes,noise=TRUE,cluster_rows=FALSE,zsc=TRUE)

One can also cluster the cells and/or genes in the heatmap based on the noise quantification:

ph <- plotmarkergenes(sc,genes=genes,noise=TRUE,cluster_rows=TRUE,cluster_cols=TRUE)

The gene expression itself can be represented with the same order of cells and genes by utilizing the ph output:

plotmarkergenes(sc,genes=ph$tree_row$labels[ ph$tree_row$order ],noise=FALSE,cells=ph$tree_col$labels[ ph$tree_col$order ], order.cells=TRUE,cluster_rows=FALSE)

It is also possible to order genes by their noise levels across a (set of) cluster or the entire dataset, if the cl argument is omitted:

mgenes <- maxNoisyGenes(noise,cl=cl,set=5)
head(mgenes)

##    Rplp2    Dmbt1   Tubb4b   Tuba1b  Tmem38b      Fos 
## 3.890883 3.148788 2.484531 2.311857 2.248384 2.170655

plotmarkergenes(sc,genes=head(names(mgenes),50),noise=TRUE)

To further study dynamics of gene expression across cell types and along inferred differentiation trajectories, the noise estimates can be co-analyzed with gene expression and lineage trajectories as inferred by RaceID and StemID. In particular, we mention that imputing expression with RaceID follows the same logic as in pruneKnn. Therefore, the derived variability estimates can be analyzed in conjunction with the imputed mean expression in the same local neighbourhood (see compdist and imputeexp functions of RaceID).

Removing batch effects and other confounding factors with VarID

VarID offers the possibility to remove batch effects and other confounding sources of variability. To achieve this, batch variables and other latent variables can be included as dependent variables in the negative binomial regression, utilized in the pruneKnn and compNoise functions if the parameter regNB is set to TRUE. For the pruneKnn function, this option only exists if large is set to TRUE. The batch parameter is given by a categorical vector of batch names.

sc <- SCseq(intestinalData)
sc <- filterdata(sc)
d <- getExpData(sc)
batch <- sub("_.+","",colnames(d))
names(batch) <- colnames(d)
head(batch)

To regress out additional latent variables, the parameter regVar has to be initialized with a data.frame of these variable, e.g. a cell cycle score. Column names of regVar indicate the latent variable names, and row names correspond to cell IDs, i.e. valid column names of d. Interaction terms between the batch variable and the UMI count or the other latent variables in regVar are included.

res <- pruneKnn(d,large=TRUE,pcaComp=100,regNB=TRUE,batch=batch,genes=NULL,knn=10,alpha=1,no_cores=1,FSelect=FALSE,ngenes=2000)
cl <- graphCluster(res,pvalue=0.01)

The corrected noise estimate can be computed for the local neighbourhoods of all cells:

noise <- compNoise(d,res,regNB=TRUE,batch=batch,pvalue=0.01,genes = NULL,no_cores=1)

Here, the pearson residuals from the regression are used if regNB equals TRUE. If the regression has been computed by pruneKnn and the slot res$regData is not NULL, the pearson residuals from this step are used and the calculation is not repeated. However, it is also possible to determine neighborhoods with a given distance matrix and apply the regression only at this step for the noise inference.

To visualize the clustering with the help of RaceID, one can update the Scseq object with the results:

sc <- updateSC(sc,res=res,cl=cl,noise=noise)

After computing a t-SNE map or, similarly, a umap, the clustering can be also be visualized in these maps:

sc <- compumap(sc)
plotsymbolsmap(sc,batch)

See previous paragraph for more advice on how to explore the output data.

For bug reports and any questions related to RaceID and StemID please email directly to link.

FAQ

How can I perform RaceID/VarID analysis on 10x Genomics Chromium Single Cell RNA-Seq data?

Processed count data from the Cell Ranger pipeline (or equivalent formats) can be read from the working directory (or any other path) and transformed into a sparse count matrix, which can be loaded into an SCseq object. This requires the Matrix package which is installed as a dependency together with RaceID. The following chunk of code can be used:

require(Matrix)
require(RaceID)
x <- readMM("matrix.mtx")
f <- read.csv("features.tsv",sep="\t",header=FALSE)
b <- read.csv("barcodes.tsv",sep="\t",header=FALSE)
rownames(x) <- f[,1]
colnames(x) <- b[,1]

sc <- SCseq(x)

How should I analyze large datasets (>25k cells)?

Datasets with >25k cells should be analyzed with VarID instead of RaceID, which is more suitable for large datasets and does not require a distance matrix. The memory requirement of a distance matrix scales with N^2 as a function of the number of cells N. VarID should be run with with parameters large=TRUE and regNB=TRUE of the pruneKnn function in order to achieve optimal performance (default parameters). To increase processing speed VarID can be run with multiple cores. However, each core requires additional memory, and with too many cores VarID will run out of memory. To reduce computation time, the number of genes used for the negative binomial regression of total UMI count per cell can be limited to, e.g., 2000 by setting the ngenes parameter.

Example for mouse data, including filtering of mitochondrial genes, ribosomal genes, and Gm* genes (and genes correlating to these groups):

require(Matrix)
require(RaceID)
x <- readMM("matrix.mtx")
f <- read.csv("features.tsv",sep="\t",header=FALSE)
b <- read.csv("barcodes.tsv",sep="\t",header=FALSE)
rownames(x) <- f[,1]
colnames(x) <- b[,1]

sc <- SCseq(x)
sc <- filterdata(sc,mintotal=1000,CGenes=rownames(x)[grep("^(mt|Rp(l|s)|Gm\\d)",rownames(x))])
expData <- getExpData(sc)
res   <- pruneKnn(expData,no_cores=5)
cl    <- graphCluster(res,pvalue=0.01)
probs <- transitionProbs(res,cl)

## compute noise from corrected variance
noise <- compNoise(expData,res,regNB=FALSE,pvalue=0.01,no_cores=5)
sc <- updateSC(sc,res=res,cl=cl,noise=noise,flo=.1)

sc <- updateSC(sc,res=res,cl=cl)
sc <- comptsne(sc)
sc <- compumap(sc)

What can I do if too many clusters were detected in my dataset?

This could be a consequence of technical noise, e.g., if experimental batches with strong differences in total UMI counts were co-analyzed. In such cases the background models of RaceID or VarID could indicate the presence of many spurious outliers. In RaceID, the number of outliers can be reduced by setting the the probthr parameter of the findoutliers function to values closer to zero. This is the probability threshold for outlier detection. In VarID the pvalue parameter of the graphCluster function can be set to lower values. This is the probability threshold for pruning links of the knn network. In the extreme case that no outliers (or pruning) are desired, probthr (or pvalue) should be set to zero. Moreover, the use.weights parameter of the graphCluster function can be set to FALSE; in this case, Louvain clustering is performed on a (pruned) knn-network with equal weights for all links. By default, links are weighted by the link probability, and if strong technical differences lead to low link probabilities, this could be the reason for the emergence of too many clusters.

References

Grün, D., L. Kester, and A. van Oudenaarden. 2014. “Validation of noise models for single-cell transcriptomics.” Nat. Methods 11 (6): 637–40.

Grün, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2): 266–77.

Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnology 36 (5): 421–27.

Herman, J. S., Sagar, and D. Grün. 2018. “FateID infers cell fate bias in multipotent progenitors from single-cell RNA-seq data.” Nat. Methods 15 (5): 379–86.

Love, M. I., W. Huber, and S. Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biol. 15 (12): 550.

RaceID reference manual

2022-07-19