Using LFC

Florian Erhard

2022-04-25

Digital expression measurements (e.g. RNA-seq) are often used to determine the change of quantities upon some treatment or stimulus. The resulting value of interest is the fold change (often logarithmized).

This effect size of the change is often treated as a value that can be computed as \(lfc(A,B)=\log_2 \frac{A}{B}\). However, due to the probabilistic nature of the experiments, the effect size rather is a random variable that must be estimated. This fact becomes obvious when considering that \(A\) or \(B\) can be 0, even if the true abundance is non-zero.

We have shown that this can be modelled in a Bayesian framework [1,2]. The intuitively computed effect size is the maximum likelihood estimator of a binomial model, where the effect size is not represented as fold change, but as proportion (of note, the log fold change simply is the logit transformed proportion). The Bayesian prior corresponds to pseudocounts frequently used to prevent infinite fold changes by \(A\) or \(B\) being zero. Furthermore, the Bayesian framework offers more advanced estimators (e.g. interval estimators or the posterior mean, which is the optimal estimator in terms of squared errors).

This R package offers the implementation to harness the power of this framework.

Basic usage

The most basic function to estimate effect sizes is

PsiLFC(A,B)

\(A\) and \(B\) are vectors of counts, corresponding to the n genes in conditions \(A\) and \(B\) (i.e. they are columns from the normal count matrices). What PsiLFC does it to obtain reasonable pseudocounts (see next section), compute the posterior mean for each entry, and then output normalized (i.e. median-centered) effect sizes in the log\(_2\) fold change representation.

Let’s consider a real world example:

library(DESeq2)
library(lfc)
data(airway, package="airway")
A <- assay(airway)[,2]
B <- assay(airway)[,1]

ll <- PsiLFC(A,B)
plot(ecdf(ll),xlim=c(-1,1),xlab="Log2 fold change treated/untreated",ylab="Cumulative frequency",main="Cell N61311",col='blue')
lines(ecdf(CenterMedian(log2((A+1)/(B+1)))),col='red')

This shows the log fold change distributions estimated by the posterior mean (blue) and the intuitive way using pseudocounts of 1 (red). This distribution is heavily distorted by several genes with no reads in any of the two conditions. Interestingly, the intuitive way of computing effect sizes results in an asymmetric distribution with more downregulated than upregulated genes.

plot(ecdf(ll[A>0 | B>0]),xlim=c(-1,1),xlab="Log2 fold change treated/untreated",ylab="Cumulative frequency",main="Cell N61311",col='blue')
lines(ecdf(CenterMedian(log2((A+1)/(B+1))[A>0 | B>0])),col='red')

Here, only genes are considered that have at least one read in one of the two conditions. The intuitive fold changes still appear to overestimate changes, as well as show more artifacts.

It is also possible to directly estimate effect sizes on SummarizedExperiment objects:

head(PsiLFC.se(airway,contrast=c("dex","untrt","trt")))
## ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 
##      0.33634752      0.00000000     -0.21035628     -0.04804218      0.10586343 
## ENSG00000000938 
##      0.89037506

Also, this package provides a drop-in replacement for DESeq2’s results function:

dds <- DESeqDataSetFromMatrix(countData = assay(airway),colData = colData(airway),design= ~ dex)
dds <- DESeq(dds)
res <- results(dds, contrast=c("dex","untrt","trt"),cre=TRUE)
head(res)
## DataFrame with 6 rows and 9 columns
##                   baseMean log2FoldChange     lfcSE      stat    pvalue
##                  <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 708.602170      0.3788470  0.173141  2.188082 0.0286636
## ENSG00000000005   0.000000             NA        NA        NA        NA
## ENSG00000000419 520.297901     -0.2037604  0.100599 -2.025478 0.0428183
## ENSG00000000457 237.163037     -0.0340428  0.126279 -0.269584 0.7874802
## ENSG00000000460  57.932633      0.1171786  0.301237  0.388992 0.6972820
## ENSG00000000938   0.318098      1.7245505  3.493633  0.493627 0.6215698
##                      padj     PsiLFC Credible 0.05 Credible 0.95
##                 <numeric>  <numeric>     <numeric>     <numeric>
## ENSG00000000003  0.139383  0.3363475      0.273959     0.3988301
## ENSG00000000005        NA  0.0000000     -1.817845     1.8398897
## ENSG00000000419  0.183458 -0.2103563     -0.282880    -0.1378499
## ENSG00000000457  0.930540 -0.0480422     -0.155755     0.0597279
## ENSG00000000460  0.895428  0.1058634     -0.112729     0.3250639
## ENSG00000000938        NA  0.8903751     -0.665848     2.5667235

Important: To make this work, load lfc after DESeq2!

Note that here we also computed the 90% credible interval. The parameter cre can also be given to PsiLFC or PsiLFC.se!

References

  1. Erhard, F. (2018). Estimating pseudocounts and fold changes for digital expression measurements. Bioinformatics, 34(23), 4054–4063. https://doi.org/10.1093/bioinformatics/bty471
  2. Erhard, F., & Zimmer, R. (2015). Count ratio model reveals bias affecting NGS fold changes. Nucleic Acids Research, 43(20), e136–e136. https://doi.org/10.1093/nar/gkv696