Introducing familiar

Alex Zwanenburg

2022-04-07

library(familiar)
library(data.table)

Familiar is a package that allows for end-to-end machine learning of tabular data, with subsequent evaluation and explanation of models. This vignette provides an overview of its functionality and how to configure and run an experiment.

Familiar in brief

This section provides installation instructions, a brief overview of the package, and the pipeline encapsulated by the summon_familiar function that is used to run an experiment.

Installing familiar

Stable versions of familiar can be installed from CRAN. dependencies=TRUE prevents being prompted to install packages when using familiar.

install.packages("familiar",
                 dependencies=TRUE)

It can also be installed directly from the GitHub repository:

require(devtools)
devtools::install_github("https://github.com/alexzwanenburg/familiar",
                         dependencies=TRUE)

Pipeline

The pipeline implemented in familiar follows a standard machine learning process. A development dataset is used to perform the steps listed below. Many aspects of these steps can be configured, but the overall process is fixed:

After training the models, the models are assessed using the development and any validation datasets. Models, and results from this analysis are written to a local directory.

Supported outcomes

Familiar supports modelling and evaluation of several types of endpoints:

Other endpoints are not supported. Handling of competing risk survival endpoints is planned for future releases.

Running familiar

The end-to-end pipeline is implemented in the summon_familiar function. This is the main function to use.

In the example below, we use the iris dataset, specify some minimal configuration parameters, and run the experiment. In practice, you may need to specify some additional configuration parameters, see the Configuring familiar section.


# Example experiment using the iris dataset.
# You may want to specify a different path for experiment_dir.
# This is where results are written to.
familiar::summon_familiar(data=iris,
                          experiment_dir=file.path(tempdir(), "familiar_1"),
                          outcome_type="multinomial",
                          outcome_column="Species",
                          experimental_design="fs+mb",
                          cluster_method="none",
                          fs_method="mrmr",
                          learner="glm",
                          parallel=FALSE)

It is also possible to use a formula instead. This is generally feasible only for datasets with few features:


# Example experiment using a formula interface.
# You may want to specify a different path for experiment_dir.
# This is where results are written to.
familiar::summon_familiar(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
                          data=iris,
                          experiment_dir=file.path(tempdir(), "familiar_2"),
                          outcome_type="multinomial",
                          experimental_design="fs+mb",
                          cluster_method="none",
                          fs_method="mrmr",
                          learner="glm",
                          parallel=FALSE)

Data does not need to be loaded prior to calling summon_familiar. A path to a csv file can also be provided. The data can also be a data.frame or data.table contained in an RDS or RData file. Other data formats are currently not supported. If categorical features are encoded using integer values, it is recommended to load the data and manually encode them, as is explained in the Preparing your data section.


# Example experiment using a csv datafile.
# Note that because the file does not exist,
# you will not be able to execute the code as is.
familiar::summon_familiar(data="path_to_data/iris.csv",
                          experiment_dir=file.path(tempdir(), "familiar_3"),
                          outcome_type="multinomial",
                          outcome_column="Species",
                          class_levels=c("setosa", "versicolor", "virginica"),
                          experimental_design="fs+mb",
                          cluster_method="none",
                          fs_method="mrmr",
                          learner="glm",
                          parallel=FALSE)

For reproducibility purposes, it may be useful to configure summon_familiar using the configuration xml file. In that case, we will point to a data file using the data_file parameter.


# Example experiment using a configuration file.
# Note that because the file does not exist,
# you will not be able to execute the code as is.
familiar::summon_familiar(config="path_to_configuration_file/config.xml")

Configuration parameters may also be mixed between parameters specified in the xml file and function arguments. Function arguments supersede parameters specified in the xml file:


# Example experiment using a csv datafile, but with additional arguments.
# Note that because the configuration file does not exist,
# you will not be able to execute the code as is.
familiar::summon_familiar(config="path_to_configuration_file/config.xml",
                          data=iris,
                          parallel=FALSE)

Configuring familiar

Familiar is highly configurable. Parameters can be specified in two ways:

  1. Using a configuration file. An empty copy of the configuration file can be obtained using the familiar::get_xml_config function. The familiar::summon_familiar function should subsequently be called by specifying the config argument.

  2. By specifying function arguments for the familiar::summon_familiar function.

All configuration parameters are documented in the help file of the familiar::summon_familiar function. Often, the default settings suffice. The parameters below should always be specified:

Though not always required, specifying the following parameters is recommended or situationally required:

Preparing your data

Familiar processes tabular data. In this case, a table consists of rows that represent instances, and columns that represent features and additional information. This is a very common representation for tabular data. Let us look at the colon dataset found in the survival package, which contains data from a clinical trial to assess a new anti-cancer drug in patients with colon cancer:

# Get the colon dataset.
data <- data.table::as.data.table(survival::colon)[etype==1]

# Drop some irrelevant columns.
data[, ":="("node4"=NULL, "etype"=NULL)]

knitr::kable(data[1:5])
id study rx sex age obstruct perfor adhere nodes status differ extent surg time
1 1 Lev+5FU 1 43 0 0 0 5 1 2 3 0 968
2 1 Lev+5FU 1 63 0 0 0 1 0 2 3 0 3087
3 1 Obs 0 71 0 0 1 7 1 2 2 0 542
4 1 Lev+5FU 0 66 1 0 0 6 1 2 3 1 245
5 1 Obs 1 69 0 0 0 22 1 2 3 1 523

Here we see that each row contains a separate instance.

Identifier columns

The id and study columns are identifier columns. Familiar distinguishes four different types of identifiers:

  • Batch identifiers are used to identify data belonging to a batch, cohort or specific dataset. This is typically used for specifying external validation datasets (using the validation_batch_id parameter). It also used to define the batches for batch normalisation. The name of the column containing batch identifiers (if any) can be specified using the batch_id_column parameter. If no column with batch identifiers is specified, all instances are assumed to belong to the same batch. In the colon dataset, the study column is a batch identifier column.

  • Sample identifiers are used to identify data belonging to a single sample, such as a patient, subject, customer, etc. Sample identifiers are used to ensure that instances from the same sample are not inadvertently spread across development and validation data subsets created for cross-validation or bootstrapping. This prevents information leakage, as instances from the same sample are often related – knowing one instance of a sample would make it easy to predict another, thus increasing the risk of overfitting. The name of the column containing sample identifiers can be specified using the sample_id_column parameter. If not specified, it is assumed that each instance forms a separate sample. In the colon dataset, the id column contains sample identifiers.

  • Within a sample, it is possible to have multiple series, for example due to measurements at different locations in the same sample. A series differs from repeated measurements. While for series the outcome value may change, this is not allowed for repeated measurements. The column containing series identifiers may be specified by providing the column name as the series_id_column parameter. If not set, all instances of a sample with a different outcome value will be assigned a unique identifier.

  • Within a sample, or series, it is possible to have repeated measurements, where one or more feature values may change but the outcome value does not. Such instances can for example used to assess feature robustness. Repeated measurement identifiers are automatically assigned for instances that have the same batch, sample and series identifiers.

Outcome columns

The colon dataset also contains two outcome columns: time and status that define (censoring) time and survival status respectively. Survival status are encoded as 0 for alive, censored patients and 1 for patients that passed away after treatment. Note that these correspond to default values present in familiar. It is not necessary to pass these values as censoring_indicator and event_indicator parameters.

Feature columns

The remaining columns in the colon dataset represent features. There are two numeric features, age and nodes, a categorical feature rx and several categorical and ordinal features encoded with integer values. Familiar will automatically detect and encode features that consist of character, logical or factor type. However, it will not automatically convert the features encoded with integer values. This is by design – familiar cannot determine whether a feature with integer values is intended to be a categorical feature or not. Should categorical features that are encoded with integers be present in your dataset, you should manually encode such values in the data prior to passing the data to familiar. For the colon dataset, this could be done as follows:

# Categorical features
data$sex <- factor(x=data$sex, levels=c(0, 1), labels=c("female", "male"))
data$obstruct <- factor(data$obstruct, levels=c(0, 1), labels=c(FALSE, TRUE))
data$perfor <- factor(data$perfor, levels=c(0, 1), labels=c(FALSE, TRUE))
data$adhere <- factor(data$adhere, levels=c(0, 1), labels=c(FALSE, TRUE))
data$surg <- factor(data$surg, levels=c(0, 1), labels=c("short", "long"))

# Ordinal features
data$differ <- factor(data$differ, levels=c(1, 2, 3), labels=c("well", "moderate", "poor"), ordered=TRUE)
data$extent <- factor(data$extent, levels=c(1, 2, 3, 4), labels=c("submucosa", "muscle",  "serosa", "contiguous_structures"), ordered=TRUE)

knitr::kable(data[1:5])
id study rx sex age obstruct perfor adhere nodes status differ extent surg time
1 1 Lev+5FU male 43 FALSE FALSE FALSE 5 1 moderate serosa short 968
2 1 Lev+5FU male 63 FALSE FALSE FALSE 1 0 moderate serosa short 3087
3 1 Obs female 71 FALSE FALSE TRUE 7 1 moderate muscle short 542
4 1 Lev+5FU female 66 TRUE FALSE FALSE 6 1 moderate serosa long 245
5 1 Obs male 69 FALSE FALSE FALSE 22 1 moderate serosa long 523

Manual encoding also has the advantage that ordinal features can be specified. Familiar cannot determine whether features with character type values have an associated order and will encode these as regular categorical variables. Another advantage is that manual encoding allows for specifying the reference level, i.e. the level to which other levels of a feature are compared in regression models. Otherwise, the reference level is taken as the first level after sorting the levels.

Experimental designs

The experimental design defines how data analysis is performed. Familiar allows for various designs, from very straightforward training on a single dataset, to complex nested cross-validation with external validation. Experimental design is defined using the experimental_design parameter and consists of basic workflow components and subsampling methods. The basic workflow components are:

Each basic workflow component can only appear once in the experimental design. It is possible to form an experiment using just the basic workflow components, i.e. fs+mb or fs+mb+ev. In these experiments, feature selection is directly followed by modelling, with external validation of the model on one or more validation cohorts for fs+mb+ev. These options correspond to TRIPOD type 1a and 3, respectively. TRIPOD analysis types 1b and 2 require more complicated experimental designs, which are facilitated by subsampling.

Hyperparameter optimisation does not require explicit specification. Hyperparameter optimisation is conducted when required to determine variable importance and prior to building a model.

Subsampling methods are used to (randomly) sample the data that are not used for external validation, and divide these data into internal development and validation sets. Thus the dataset as a whole is at most divided into three parts: internal development, internal validation and external validation. Familiar implements the following subsampling methods:

The x argument of subsample methods can contain one or more of the workflow components. Moreover, it is possible to nest subsample methods. For example, experiment_design="cv(bt(fs,50)+mb,5)+ev" would create a 5-fold cross-validation of the development dataset, with each set of training folds again subsampled for feature selection. After aggregating variable importance obtained over 50 bootstraps, a model is trained within each set of training folds, resulting in 5 models overall. The ensemble of these models is then evaluated on an external dataset.

Other designs, such as experiment_design="bs(fs+mb,400)+ev" allow for building large ensembles, and capturing the posterior distribution of the model predictions.

As a final remark: Though it is possible to encapsulate the external validation (ev) workflow component in a subsampler, this is completely unnecessary. Unlike the feature selection (fs) and modelling (mb) components, ev is passive, and only indicates whether external validation should be performed.

References

Cortes, David. 2021. Isotree: Isolation-Based Outlier Detection. https://CRAN.R-project.org/package=isotree.
Royston, Patrick, and Douglas G Altman. 2013. “External Validation of a Cox Prognostic Model: Principles and Methods.” BMC Med. Res. Methodol. 13 (March): 33.