Biopeak R Package Implementation

David Lauenstein

2019-08-09

1 Overview

The Biopeak R package enables the user to systematically identify and visualize impulse-like gene expression changes within short genomic series experiments. On a systems level, this tool allows to characterize dynamic gene expression signatures underlying biological processes. In order to detect such activation peaks, the gene expression is treated as a signal that propagates along the experimental axis (time, temperature or other series conditions).

2 Peak Detection with the Biopeak package

To demonstrate the functionality of this package, we are using a dataset comprising transcriptional profiles of human epithelial cells in response to heat available on GEO (GSE7458) (J. M. Laramie 2008). In vitro cultured HEp2 cells were heated at 37 to 43 °C for 60 min and microarray gene expression profiles were acquired at 37, 40, 41, 42 and 43 °C.

In a first step, we load the heat-shock data:

library(Biopeak)
# load the heat-shock data
exprmat <- heat
# Show first rows and columns of the expression matrix
head(exprmat)
##            Hep2_37    Hep2_40    Hep2_41   Hep2_42    Hep2_43
## DAZL      25.64188   26.44370   48.65872  47.51223   17.54344
## RAD23A   776.89917  609.15550  602.97183 823.58900  502.01850
## RAP1B   1033.05450 1232.86550 1028.10150 900.49950 1074.90267
## ARHGAP6   35.05837   53.72763   26.60629  31.21045   29.21352
## GYPB     157.53467  167.29190  159.62667 157.61000  196.08050
## UBE2N    464.30233  517.65315  549.37597 682.46653  521.46451
# Transform data frame to numeric matrix
exprmat <- as.matrix(exprmat)

One of the strengths of the peak detection algorithm lays in its great flexibility to process data source-independently. While the main focus of this package is aimed at the characterization of genomic data, in theory, any type of series experiment that can be described as a signal could be analyzed. Since peaks are assessed for each gene relative to its expression over time or over the course of different stimuli strenghts, prior normalization steps are usualy not required. Therefore, we can directly subject the Rna-Seq or microarray count expression matrix to the peak detection algorithm: the peakDetection function.

The peakDetection function takes as input the expression matrix and the series description, a numeric vector which defines the conditions/time-points of sample acquisition. In order to identify biomarkers which are specific to a distinct phase of the observed biological process, we provide a set of tunable filters that define the characterstics of a peak. Peak characteristics that have to be defined are:

Furthermore, the function supports optional parameters for more complex operations:

In this case study, we are primarily interested in finding genes that peak at a single condition, in order to identify gene activation thresholds under heat-shock conditions. Thus, we set the prominence parameter to 1.3, to select for genes that feature a peak that is at least 30 % higher in magnitude than any other peak. Moreover, we want to filter out peaks with low heights relative to the mean expression. Therefore the actstrength parameter is set to 1.5, selecting for peaks that feature at least 50 % higher expression than the mean expression of that gene. Lastly, we selected only the top 50 % expressed genes, to reduce the computational load.

# define condition series
series <- c(37,40,41,42,43)
# Find the expression limit for the 50 % highest expressing genes and selec them
exprlim <- quantile(exprmat, probs <- seq(0,1,0.01))[51]
exprmat <- exprmat[which(rowMeans(exprmat) > exprlim),]
# run the peak detection algorithm
peakdet <- peakDetection(exprmat, series, type ='rnaseq', actstrength = 1.5, prominence = 1.3,
                         bgcorr = F)

The peakDetection function returns a set of vectors and matrices:

3 Discovering the temporal regulation of activated biomarkers

The output of the peakDetection function provides a format that facilitates queries on a gene-level and systems-level.

# number of genes detected by the peakDetection function
length(peakdet$peakgenes)
## [1] 205
# filter for genes which peak at 43 °C
peakdet$peakgenes[which(peakdet$peakloc == 43)]
##  [1] "CNNM2"                            "SFI1"                            
##  [3] "PNN"                              "KRT7"                            
##  [5] "BASP1"                            "RAF1"                            
##  [7] "GBA /// GBAP1 /// LOC100510710"   "GADD45B"                         
##  [9] "MKL1"                             "CDK13"                           
## [11] "NRG1"                             "XKRY /// XKRY2"                  
## [13] "H3F3A /// H3F3B /// MIR4738"      "RCAN2"                           
## [15] "LOC100507577 /// LONP2"           "SLC7A5"                          
## [17] "RPL23 /// RPL23P8 /// SNORA21"    "AKR7A2"                          
## [19] "ZNF787"                           "NUP188"                          
## [21] "ACTA2"                            "APOF"                            
## [23] "HOXD13"                           "BSN"                             
## [25] "PPAP2B"                           "CAPN1"                           
## [27] "CNKSR2"                           "USP6NL /// USP6NL-IT1"           
## [29] "FLRT2 /// LOC100506718"           "TNFRSF8"                         
## [31] "MAGEC1"                           "ARSF"                            
## [33] "BRS3"                             "C2CD2L"                          
## [35] "KRT18"                            "DLEC1"                           
## [37] "SYT17"                            "ANXA11"                          
## [39] "KIR3DL1 /// KIR3DL2"              "ARHGAP29"                        
## [41] "MXRA5"                            "IFI44L"                          
## [43] "HAAO"                             "CKMT2"                           
## [45] "DBI"                              "CUL3"                            
## [47] "DOPEY2"                           "CYTH2"                           
## [49] "TUBB2B"                           "TNFRSF14"                        
## [51] "C5orf45"                          "CAPRIN1"                         
## [53] "ITGA5"                            "GBF1"                            
## [55] "HEXIM1"                           "P2RX5"                           
## [57] "ID2"                              "CHD9"                            
## [59] "MT1F /// MT1G /// MT1M /// MT1P3" "GATA6"                           
## [61] "MSX2P1"                           "Fas"
# return the peakheight and location of the gene CDK13
c(peakdet$peakheight[which(peakdet$peakgenes == 'CDK13')],
  peakdet$peakloc[which(peakdet$peakgenes == 'CDK13')])
## [1] 4114.631   43.000
# assess the main peak neighborhood for sustained activation
peakdet$neighbors[which(peakdet$peakgenes == 'CDK13'),]
## [1] 0 4 0 0 1

The plotExpression function allows to plot the expression signal of individual genes:

3.1 Plot the expression signal of individual genes

\label{fig:plotexpression}Figure 1: Individual gene expression signal for CDK13. The dashed line marks the main peak location.

Figure 1: Individual gene expression signal for CDK13. The dashed line marks the main peak location.

3.2 Gene co-expression analysis

Gene co-expression analyses are a widely used tool in genomic studies. In this package we provide the getCormat function which calculates a gene-wise correlation matrix and plots a bi-clustered heatmap. The function returns both the heatmap object and the re-ordered correlation matrix:

## null device 
##           1
##                           TNNT1 ST3GAL5 CHD2
## RIOK3                      -0.6    -0.6 -0.6
## JMJD6                      -0.6    -0.6 -0.6
## IL23A /// TRBC2 /// TRBC2  -0.6    -0.6 -0.6
## IL1R1                      -0.6    -0.6 -0.6
## COPS5                      -0.6    -0.6 -0.6
##                           LOC101929087 /// SUMO2 /// SUMO3 HHLA1
## RIOK3                                                 -0.6  -0.7
## JMJD6                                                 -0.6  -0.7
## IL23A /// TRBC2 /// TRBC2                             -0.6  -0.7
## IL1R1                                                 -0.6  -0.7
## COPS5                                                 -0.6  -0.7

3.3 Identification of biomarker clusters with similar gene regulation

The peakDetection algorithm represents a powerful tool for the exhaustive identification of condition or time-point specific biomarkers. For the analysis of the temporal gene expression and given enough time-points, those time-points can be further grouped together to reflect waves of gene activation, such as an immediate, early, mid and late response. Since in most explorative studies the number of such waves underlying a given biological process is unknown, the Biopeak package provides a cluster exploration function: findClusters. The findClusters function estimates the number of genes with similar temporal regulation and supports three different clustering algorithms: kmeans, dbscan and hierarchical clustering. Clustering is based on a PCA projection of the input data and cluster assignment and parameters are returned by the function.

\label{fig:findclusters}Figure 3: Clusters of biomarkers with similar temporal regulation based on three different algorithms (hclust, dbscan and kmeans).

Figure 3: Clusters of biomarkers with similar temporal regulation based on three different algorithms (hclust, dbscan and kmeans).

3.4 Plot selected biomarkers in a heatmap

The plotHeatmap function represents a simple wrapper function that takes the output of the peakDetection function and normalizes the expression to log2 values for improved visualization before subjecting it to the heatmap.2 function. The clustermembers parameter is optional and allows to select for specific genes. Here, we take the output of the findClusters function and plot a heatmap for one of the clusters:

\label{fig:plotheatmap}Figure 4: Heatmap of the second cluster returned by the findClusters function. The second cluster reflects primarily genes involved in the immediate LPS-induced response.

Figure 4: Heatmap of the second cluster returned by the findClusters function. The second cluster reflects primarily genes involved in the immediate LPS-induced response.

3.5 Save the peak detection output to a text file

The function saveOutput saves the output of the peakDetetection funtion (peakgenes, peaklocation and peakheight) to a text file.

4 References

J. M. Laramie, B. Brownstein, T. P. Chung. 2008. Transcriptional Profiles of Human Epithelial Cells in Response to Heat. Shock.