Olink® Analyze Vignette

Olink DS team

2022-06-10

Olink® Analyze is an R package that provides a versatile toolbox to enable fast and easy handling of Olink® NPX data for your proteomics research. Olink® Analyze provides functions for using Olink data, including functions for importing Olink® NPX datasets exported from the NPX Manager, as well as quality control (QC) plot functions and functions for various statistical tests. This package is meant to provide a convenient pipeline for your Olink NPX data analysis.

Installation

You can install Olink® Analyze from CRAN.

install.packages("OlinkAnalyze")

List of functions

Preprocessing

Statistical analysis

Visualization

Sample datasets

Usage

Load the library

# Load OlinkAnalyze
library(OlinkAnalyze)

# Load other libraries used in Vignette
library(dplyr)
library(ggplot2)
library(stringr)

Preprocessing

Read NPX data (read_NPX)

The read_NPX function imports an NPX file of wide format that has been exported from Olink® NPX Manager and converts the data into the (preferred by R) long format. The wide format is the most common way Olink® delivers data for Olink® Target 96, however, for data analysis a long format is preferred. No prior alterations to the output of the NPX Manager should be made for this function to work as expected.

Function arguments

  • filename: Path to the NPX Manager output file.
data <- read_NPX("~/NPX_file_location.xlsx")

Function output

A tibble in long format containing:

  • SampleID: Sample names or IDs.
  • Index: Unique number for each SampleID. It is used to make up for non unique sample IDs.
  • OlinkID: Unique ID for each assay assigned by Olink. In case the assay is included in more than one panels it will have a different OlinkID in each one.
  • UniProt: UniProt ID.
  • Assay: Common gene name for the assay.
  • MissingFreq: Missing frequency for the OlinkID, i.e. frequency of samples with NPX value below limit of detection (LOD).
  • Panel: Olink Panel that samples ran on. Read more about Olink Panels here: https://www.olink.com/products-services/.
  • Panel_Version: Version of the panel. A new panel version might include some different or improved assays.
  • PlateID: Name of the plate.
  • QC_Warning: Indication whether the sample passed Olink QC. Read more here: https://www.olink.com/faq/how-is-quality-control-of-the-data-performed/.
  • LOD: Limit of detection (LOD) is the minimum level of an individual protein that can be measured. LOD is defined as 3 times the standard deviation over background.
  • NPX: Normalized Protein eXpression, is Olink’s unit of protein expression level in a log2 scale. The majority of the functions of this package use NPX values for calculations. Read more about NPX here: https://www.olink.com/faq/what-is-npx/.

Statistical analysis

Post-hoc ANOVA analysis (olink_anova_posthoc)

olink_anova_posthoc performs a post-hoc ANOVA test using the function emmeans from the R library emmeans with Tukey p-value adjustment per assay (by OlinkID) at confidence level 0.95.

The function handles both factor and numerical variables and/or covariates. The post-hoc test for a numerical variable compares the difference in means of the outcome variable (default: NPX) for 1 standard deviation (SD) difference in the numerical variable, e.g. mean NPX at mean (numerical variable) versus mean NPX at mean (numerical variable) + 1*SD (numerical variable).

Function arguments

  • df: NPX data frame in long format should minimally contain protein name (Assay), OlinkID, UniProt, Panel and an outcome factor with at least 3 levels.
  • olinkid_list: Character vector of OlinkID’s on which to perform the post-hoc analysis. If not specified, all assays in df are used.
  • variable: Single character value or character array. In case of single character then that should represent a column in the df. Otherwise, if length > 1, the included variable names will be used in crossed analyses. It can also accept the notations ‘:’ or ’*’.
  • covariates: Single character value or character array. Default: NULL. Confounding factors to include in the analysis. In case of single character then that should represent a column in the df. It can also accept the notations ‘:’ or ’*’, while crossed analysis will not be inferred from main effects.
  • outcome: Name of the column from df that contains the dependent variable. Default: NPX.
  • effect: Term on which to perform the post-hoc analysis. Character vector. Must be subset of or identical to the variable and no adjustment is performed.
  • mean_return: Logical. If true, returns the mean of each factor level rather than the difference in means (default). Note that no p-value is returned for mean_return = TRUE.
  • verbose: Logical. Default: True. If information about removed samples, factor conversion and final model formula is to be printed to the console.
# calculate the p-value for the ANOVA
anova_results_oneway <- olink_anova(df = npx_data1, 
                                    variable = 'Site')
# extracting the significant proteins
anova_results_oneway_significant <- anova_results_oneway %>%
  filter(Threshold == 'Significant') %>%
  pull(OlinkID)
anova_posthoc_oneway_results <- olink_anova_posthoc(df = npx_data1,
                                                    olinkid_list = anova_results_oneway_significant,
                                                    variable = 'Site',
                                                    effect = 'Site')

Function output

A tibble with the following columns:

  • Assay <chr>: Assay name.
  • OlinkID <chr>: Unique Olink ID.
  • UniProt <chr>: UniProt ID.
  • Panel <chr>: Olink Panel.
  • term <chr>: Name of the variable that was used for the p-value calculation. The “:” between variables indicates interaction between variables.
  • contrast <chr>: Variables (in term) that are compared.
  • estimate <dbl>: Difference in mean NPX between variables (from contrast).
  • conf.low <dbl>: Low bound of the confidence interval for the mean.
  • conf.high <dbl>: High bound of the confidence interval for the mean.
  • Adjusted_pval <dbl>: Adjusted p-value for the test (Benjamini & Hochberg).
  • Threshold <chr>: Text indication if assay is significant (adjusted p-value < 0.05).

Post-hoc one way non-parametric analysis (olink_one_non_parametric_posthoc)

olink_one_non_parametric_posthoc performs a post-hoc Wilcoxon test using the function wilcox_test from the R library rstatix with Benjamini & Hochberg p-value adjustment per assay (by OlinkID) at confidence level 0.95. The function handles both factor and numerical variables and/or covariates.

Function arguments

  • df: NPX data frame in long format should minimally contain protein name (Assay), OlinkID, UniProt, Panel and an outcome factor with at least 3 levels.
  • olinkid_list: Character vector of OlinkID’s on which to perform the post-hoc analysis. If not specified, all assays in df are used.
  • variable: Single character value or character array. In case of single character then that should represent a column in the df.
  • verbose: Logical. Default: True. If information about removed samples, factor conversion and final model formula is to be printed to the console.
#Friedman Test
Friedman_results <- olink_one_non_parametric(npx_df, "Time", dependence = TRUE)

#Filtering out significant and relevant results.
significant_assays <- Friedman_results %>%
  filter(Threshold == 'Significant') %>%
  dplyr::select(OlinkID) %>%
  distinct() %>%
  pull()

#Posthoc test for the results from Friedman Test
friedman_posthoc_results <- olink_one_non_parametric_posthoc(npx_df, variable = c("Time"), olinkid_list = significant_assays)

Function output

A tibble with the following columns:

  • Assay <chr>: Assay name.
  • OlinkID <chr>: Unique Olink ID.
  • UniProt <chr>: UniProt ID.
  • Panel <chr>: Olink Panel.
  • term <chr>: Name of the variable that was used for the p-value calculation.
  • contrast <chr>: Variables (in term) that are compared.
  • estimate <dbl>: Difference in mean NPX between variables (from contrast).
  • conf.low <dbl>: Low bound of the confidence interval for the location parameter.
  • conf.high <dbl>: High bound of the confidence interval for the location parameter.
  • Adjusted_pval <dbl>: Adjusted p-value for the test (Benjamini & Hochberg).
  • Threshold <chr>: Text indication if assay is significant (adjusted p-value < 0.05).

Post-hoc of regression models for ordinal data analysis (olink_ordinalRegression_posthoc)

olink_ordinalRegression_posthoc performs a post-hoc ANOVA test using the function emmeans from the R library emmeans with Tukey p-value adjustment per assay (by OlinkID) at confidence level 0.95. The function handles both factor and numerical variables and/or covariates.

Function arguments

  • df: NPX data frame in long format should minimally contain protein name (Assay), OlinkID, UniProt, Panel and an outcome factor with at least 3 levels.
  • olinkid_list: Character vector of OlinkID’s on which to perform the post-hoc analysis. If not specified, all assays in df are used.
  • variable: Single character value or character array. In case of single character then that should represent a column in the df. Otherwise, if length > 1, the included variable names will be used in crossed analyses. It can also accept the notations ‘:’ or ’*’.
  • covariates: Single character value or character array. Default: NULL. Confounding factors to include in the analysis. In case of single character then that should represent a column in the df. It can also accept the notations ‘:’ or ’*’, while crossed analysis will not be inferred from main effects.
  • outcome: Name of the column from df that contains the dependent variable. Default: NPX.
  • effect: Term on which to perform the post-hoc analysis. Character vector. Must be subset of or identical to the variable and no adjustment is performed.
  • mean_return: Logical. If true, returns the mean of each factor level rather than the difference in means (default). Note that no p-value is returned for mean_return = TRUE.
  • verbose: Logical. Default: True. If information about removed samples, factor conversion and final model formula is to be printed to the console.
# Two-way Ordinal Regression
ordinalRegression_results <- olink_ordinalRegression(df = npx_data1,
                             variable="Treatment:Time")
# extracting the significant proteins
significant_assays <- ordinalRegression_results %>% 
  filter(Threshold == 'Significant' & term == 'Treatment:Time') %>%
  select(OlinkID) %>%
  distinct() %>%
  pull()
# Posthoc test for the model NPX~Treatment*Time,
ordinalRegression_posthoc_results <- olink_ordinalRegression_posthoc(npx_data1, 
                                                                     variable=c("Treatment:Time"),
                                                                     covariates="Site",
                                                                     olinkid_list = significant_assays,
                                                                     effect = "Treatment:Time")

Function output

A tibble with the following columns:

  • Assay <chr>: Assay name.
  • OlinkID <chr>: Unique Olink ID.
  • UniProt <chr>: UniProt ID.
  • Panel <chr>: Olink Panel.
  • term <chr>: Name of the variable that was used for the p-value calculation. The “:” between variables indicates interaction between variables.
  • contrast <chr>: Variables (in term) that are compared.
  • estimate <dbl>: Difference in mean NPX between variables (from contrast).
  • Adjusted_pval <dbl>: Adjusted p-value for the test (Benjamini & Hochberg).
  • Threshold <chr>: Text indication if assay is significant (adjusted p-value < 0.05).

Post-hoc linear mixed effects model analysis (olink_lmer_posthoc)

The olink_lmer_posthoc function is similar to olink_lmer but performs a post-hoc analysis based on a linear mixed model effects model using the function lmer from the R library lmerTest and the function emmeans from the R library emmeans. The function handles both factor and numerical variables and/or covariates. Differences in estimated marginal means are calculated for all pairwise levels of a given output variable. Degrees of freedom are estimated using Satterthwaite’s approximation. The post-hoc test for a numerical variable compares the difference in means of the outcome variable (default: NPX) for 1 standard deviation difference in the numerical variable, e.g. mean NPX at mean(numerical variable) versus mean NPX at mean(numerical variable) + 1*SD(numerical variable). The output tibble is arranged by ascending adjusted p-values.

Function arguments

  • df: NPX data frame in long format should minimally contain protein name (Assay), OlinkID, UniProt, Panel and 1-2 variables with at least 2 levels and subject ID.
  • variable: Single character value or character array. In case of single character then that should represent a column in the df. Otherwise, if length > 1, the included variable names will be used in crossed analyses. It can also accept the notations ‘:’ or ’*’.
  • olinkid_list: Character vector of OlinkID’s on which to perform the post-hoc analysis. If not specified, all assays in df are used.
  • effect: Term on which to perform the post-hoc analysis. Character vector. Must be subset of or identical to the variable.
  • outcome: Name of the column from df that contains the dependent variable. Default: NPX.
  • random: Single character value or character array with random effects.
  • covariates: Single character value or character array. Default: NULL. Confounding factors to include in the analysis. In case of single character then that should represent a column in the df. It can also accept the notations ‘:’ or ’*’, while crossed analysis will not be inferred from main effects.
  • mean_return: Logical. If true, returns the mean of each factor level rather than the difference in means (default). Note that no p-value is returned for mean_return = TRUE and no adjustment is performed.
  • verbose: Logical. Default: True. If information about removed samples, factor conversion and final model formula is to be printed to the console.
# Linear mixed model with two variables.
lmer_results_twoway <- olink_lmer(df = npx_data1, 
                                  variable = c('Site', 'Treatment'),
                                  random = 'Subject')
# extracting the significant proteins
lmer_results_twoway_significant <- lmer_results_twoway %>%
  filter(Threshold == 'Significant', term == 'Treatment') %>%
  pull(OlinkID)
# performing post-hoc analysis
lmer_posthoc_twoway_results <- olink_lmer_posthoc(df = npx_data1,
                                                  olinkid_list = lmer_results_twoway_significant,
                                                  variable = c('Site', 'Treatment'),
                                                  random = 'Subject',
                                                  effect = 'Treatment') 

Function output

A tibble with the following columns:

  • Assay <chr>: Assay name.
  • OlinkID <chr>: Unique Olink ID.
  • UniProt <chr>: UniProt ID.
  • Panel <chr>: Olink Panel.
  • term <chr>: Name of the variable that was used for the p-value calculation. The “:” between variables indicates interaction between variables.
  • contrast <chr>: Variables (in term) that are compared.
  • estimate <dbl>: Difference in mean NPX between variables (from contrast).
  • conf.low <dbl>: Low bound of the confidence interval for the mean.
  • conf.high <dbl>: High bound of the confidence interval for the mean.
  • Adjusted_pval <dbl>: Adjusted p-value for the test (Benjamini & Hochberg).
  • Threshold <chr>: Text indication if assay is significant (adjusted p-value < 0.05).

Visualization

Theming function (set_plot_theme)

This function sets a coherent plot theme for plots by adding it to a ggplot object. It is mainly used for aesthetic reasons.

npx_data1 %>% 
  filter(OlinkID == 'OID01216') %>% 
  ggplot(aes(x = Treatment, y = NPX, fill = Treatment)) +
  geom_boxplot() +
  set_plot_theme()