Introduction to clustermole

Overview

A typical computational pipeline to process single-cell RNA sequencing (scRNA-seq) data includes clustering of cells as one of the steps. Assignment of cell type labels to those clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search. This is especially challenging when unexpected or poorly described populations are present. The clustermole R package provides methods to query thousands of human and mouse cell identity markers sourced from a variety of databases.

The clustermole package provides three primary features:

Usage

You can install clustermole from CRAN.

install.packages("clustermole")

Load clustermole.

library(clustermole)

Retrieve cell type markers

You can use clustermole as a simple database and get a data frame of all cell type markers.

markers <- clustermole_markers(species = "hs")
markers
#> # A tibble: 422,292 x 8
#>    celltype_full      db     species organ  celltype   n_genes gene_origi… gene 
#>    <chr>              <chr>  <chr>   <chr>  <chr>        <int> <chr>       <chr>
#>  1 1-cell stage cell… CellM… Human   Embryo 1-cell st…      32 ACCSL       ACCSL
#>  2 1-cell stage cell… CellM… Human   Embryo 1-cell st…      32 ACVR1B      ACVR…
#>  3 1-cell stage cell… CellM… Human   Embryo 1-cell st…      32 ASF1B       ASF1B
#>  4 1-cell stage cell… CellM… Human   Embryo 1-cell st…      32 BCL2L10     BCL2…
#>  5 1-cell stage cell… CellM… Human   Embryo 1-cell st…      32 BLCAP       BLCAP
#>  6 1-cell stage cell… CellM… Human   Embryo 1-cell st…      32 CASC3       CASC3
#>  7 1-cell stage cell… CellM… Human   Embryo 1-cell st…      32 CLEC10A     CLEC…
#>  8 1-cell stage cell… CellM… Human   Embryo 1-cell st…      32 CNOT11      CNOT…
#>  9 1-cell stage cell… CellM… Human   Embryo 1-cell st…      32 DCLK2       DCLK2
#> 10 1-cell stage cell… CellM… Human   Embryo 1-cell st…      32 DHCR7       DHCR7
#> # … with 422,282 more rows

Each row contains a gene and a cell type associated with it. The gene column is the gene symbol and the celltype_full column contains the full cell type string, including the species and the original database. Human or mouse versions can be retrieved.

Some tools require input as a list. To convert the markers from a data frame to a list format, you can use gene as the values and celltype_full as the grouping variable.

markers_list <- split(x = markers$gene, f = markers$celltype_full)

Check the number of cell types in the database.

length(unique(markers$celltype_full))
#> [1] 3039

Check the cell type source databases.

sort(unique(markers$db))
#> [1] "ARCHS4"     "CellMarker" "MSigDB"     "PanglaoDB"  "SaVanT"    
#> [6] "TISSUES"    "xCell"

Cell types based on marker genes

If you have a character vector of genes, such as cluster markers, you can compare them to known cell type markers to see if they overlap any of the known cell type markers (overrepresentation analysis).

my_overlaps <- clustermole_overlaps(genes = my_genes_vec, species = "hs")

Cell types based on expression matrix

If you have expression values, such as average expression across clusters, you can perform cell type enrichment based on a given gene expression matrix (log-transformed CPM/TPM/FPKM values). The matrix should have genes as rows and clusters/samples as columns. The underlying enrichment method can be changed using the method parameter.

my_enrichment <- clustermole_enrichment(expr_mat = my_expr_mat, species = "hs")