Package functionality

Our package provides a set of functions to explore and study microbiome data within the CoDA framework, with a special focus on identification of microbial signatures that can serve as biomarkers of disease risk and prognostic. Their prediction accuracy relies on the selection of the taxa that constitute the signature, which is challenging given the sparsity, multivariate and compositional inherent characteristics of microbiome data (Susin et al. 2020).

coda4microbiome performs variable selection through penalized regression in cross-sectional studies, with both binary and continuous outcome. In addition, the package incorporates a new approach for the analysis of longitudinal microbiome studies with a binary outcome.

Penalized regression implementation relies on the function cv.glmnet() from the R package glmnet (Friedman et al. 2010) adapted to CoDA by using all pairwise log-ratios of the variables (Bates and Tibshirani, 2018). The results are expressed as the (weighted) balance between two groups of taxa, those that contribute positively to the microbial signature and those that contribute negatively (Susin et al. 2020).

The interpretability of results is of major importance in this context. The package provides several graphical representations for a better interpretation of the analysis and the identified microbial signatures.

Functions for microbial signature identification

coda_glmnet: Identification of microbial signatures in cross-sectional studies.

The algorithm performs variable selection through penalized regression on the set of all pairwise log-ratios for both, binary outcome (logistic regression) and continuous outcome (linear regression). The result is expressed as the (weighted) balance between two groups of taxa. It allows the use of non-compositional covariates.

A graphical representation of the taxa that constitutes the signature and their coefficients is provided. The output also includes a plot of the prediction or classification accuracy.
coda_glmnet_longitudinal: Identification of microbial signatures in longitudinal studies.

Identification of a set of microbial taxa whose joint dynamics is associated with the phenotype of interest (binary). The algorithm performs variable selection through penalized regression over the summary of the log-ratio trajectories (AUC). The result is expressed as the (weighted) balance between two groups of taxa.

The output provides three plots: the taxa that constitutes the signature and their coefficients, the classification accuracy of the signature and the plot of the signature trajectories of the individuals.

Functions for log-ratio exploratory analysis

Previously or independently of variable selection for microbial signature identification, one may be interested in the exploratory analysis of pairwise log-ratios.

The interpretation of results of log-ratio analysis is challenging because when one taxon A is highly associated with the outcome, any log-ratio involving taxon A is likely to be associated with Y, no matter which is the second taxon involved in the log-ratio. Here we summarize the importance of each taxon A by aggregating the prediction accuracy of all log-ratios that involve taxon A.

explore_logratios: Explores the association of each log-ratio with the outcome.
explore_lr_longitudinal: Explores the association of a summary (integral) of each log-ratio trajectory with the outcome.

Both functions summarize the importance of each taxon as the aggregation of the association measures of those log-ratios involving the taxon. The output includes a plot of the association of the log-ratio with the outcome where the taxa are ranked by importance

Suplementary functions

explore_zeros:

Provides the proportion of zeros for a pair of variables (taxa) in table x and the proportion of samples with zero in both variables. A bar plot with this information is also provided. Results can be stratified by a categorical variable.
impute_zeros:

Simple imputation: When the abundance table contains zeros, a positive value is added to all the values in the table. It adds 1 when the minimum of table is larger than 1 (i.e. tables of counts) or it adds half of the minimum value of the table, otherwise.
logratios_matrix:

Computes the matrix with of all pairwise log-ratios between taxa
plot_prediction:

Plot of the predictions of a fitted model (microbial signature): Multiple box-plot and density plots for binary outcomes and Regression plot for continuous outcome
plot_signature:

Graphical representation of the variables selected and their coefficients
plot_signature_curves:

Graphical representation of the signature trajectories
coda_glmnet_null: Performs a permutational test for the coda_glmnet() algorithm

It provides the distribution of results under the null hypothesis by implementing the coda_glmnet() on different rearrangements of the response variable.
filter_longitudinal: Filters those individuals and taxa with enough longitudinal information
coda_glmnet_longitudinal_null: Performs a permutational test for the coda_glmnet_longitudinal() algorithm

It provides the distribution of results under the null hypothesis by implementing the coda_glmnet_longitudinal() on different rearrangements of the response variable.
shannon: Shannon information
shannon_effnum: Shannon effective number of variables in a composition
shannon_sim: Shannon similarity between two compositions

References

Bates S and Tibshirani R (2019) Log-ratio lasso: Scalable, sparse estimation for log-ratio models. Biometrics 75(2):613-624.

Calle ML (2019) Statistical Analysis of Metagenomics Data. Genomics & Informatics 17 (1)

Friedman J, Hastie T, Tibshirani R (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, 33(1), 1–22.

Susin A., Wang Y, Lê Cao K-A, Calle M.L. (2020) Variable selection in microbiome compositional data analysis. NAR Genomics and Bioinformatics, 2 (2)

coda4microbiome

Malu Calle and Toni Susin

Introduction

Package functionality

Functions for microbial signature identification

Functions for log-ratio exploratory analysis

Suplementary functions