Fiscore Package: Effective protein structural data visualisation and exploration

Austė Kanapeckaitė

8/10/2021

library(Fiscore)

1. Package Overview

Fiscore is an R package developed to quickly take advantage of protein topology assessment and perform various analyses or integrate derived scores into relational databases as well as machine learning pipelines. The package builds on protein structure and topology studies [1]⁠,⁠ which led to the derivation of Fi-score or mathematical equation that captures dihedral angle and B-factor parameter influence on amino acid residues in a protein. Target evaluation is paramount for rational therapeutics development [2,3]⁠. For example, to successfully engineer therapeutic antibodies, it is necessary to characterise potential binding or contact sites on target proteins [4,5]⁠. Moreover, translating structural data into parameters that can be used to classify targets, store target-ligand information, or perform machine learning tasks could significantly improve our ability to assess new targets and develop novel therapeutics [2,6,7]⁠. As a result, Fi-score, a first-of-its-kind in silico protein fingerprinting approach based on the dihedral angle and B-factor distribution, was used to develop an integrative package to probe binding sites and sites of structural importance [1]⁠. Fiscore package provides customisable options to explore targets of interest as well as interactive plots to assess individual amino acid parameters. Moreover, a user-friendly machine learning pipeline for Gaussian Mixture Models (GMM) allows a robust structural analysis.

2. Package functions and tutorial

Figure 1. Schematic representation of package functions and specific analyses. Curved arrow indicates that additional information might be supplied for density plots from the cluster identification function.

2.1. Preprocessing PDB files

Function PDB_process takes a PDB file name which can be expressed as ‘6KZ5.pdb’ or ‘path/to/the/file/6KZ5.pdb’. One of the functions dependencies is package Bio3D [8,9] that helps with PDB file preparation. In addition, another parameter that you can supply is ‘path’. Path can point to a directory where to split PDB into separate chain files (necessary for the downstream analyses) or you can leave a default option to create a folder in your working directory. Function PDB_process returns a list of split chain names. If you split multiple PDB files in a loop it will be continuously added to the same folder. Please note, for the downstream processing PDB files need to be split so that separate chains can be analysed independently.

pdb_path<- system.file("extdata", "6kz5.pdb", package="Fiscore")


pdb_df<-PDB_process(pdb_path)
##    PDB has ALT records, taking A only, rm.alt=TRUE
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
pdb_df
## [1] "6kz5_A.pdb" "6kz5_B.pdb"

2.2. Preparing PDB files

Function PDB_prepare prepares a PDB file after it was pre-processed to generate Fi-score and normalised B-factor values as well as secondary structure designations. The function takes a PDB file name that was split into chains, e.g. ‘6KZ5_A.pdb’. The file is cleaned and only the complete entries for amino acids are kept for the analysis, e.g. amino acids from the terminal residues that do not contain both dihedral angles are removed. The function returns a data frame with protein secondary structure designation ‘Type’, Fi- score values ‘Fi_score’, as well as normalised B-factor values ‘B_normalised’.

It is important to comment on the use of dihedral angles, B-factor, and the derivation of Fi-score. To capture protein structural and topological features that could help with the identification or characterisation of a binding pocket or any other relevant site, it is important to extract and combine data from structural files to allow such information integration [2,10,11]⁠. Protein dihedral angles were used as they contain information on the local and global protein conformation where protein backbone conformation can be highly accurately recreated based on the dihedral angles [1,12–14]⁠. Moreover, since Ramachandran plot, which plots φ (phi) and ψ (psi) angle distributions for individual residues, allows only a holistic description of conformation and cannot be integrated with traditional parametric or non-parametric density estimation methods, specific transformation was required to use this data. An additional parameter, namely the oscillation amplitudes of the atoms around their equilibrium positions (B-factors) in the crystal structures, was used. B-factors contain a lot of information on the overall structure, for example, these parameters depend on conformational disorder, thermal motion paths, associations with the rotameric state of amino acids side-chains, and they also show dependence on the three-dimensional structure as well as protein flexibility [1,13,15–19]⁠. Normalised dihedral angles (standard deviation scaling to account for variability) and scaled B-factors (min-max scaling) (Eq.1) were integrated into the Fi-score (Eq.2). It is important to highlight that B-factors need to be scaled so that different structural files can be compared [1]⁠.

Equation 1. Min-max normalisation and scaling of B-factor where Bi-norm is scaled B-factor, Bi - B-factor for Cα, Bmax - largest B-factor value for the total protein B-factors for all Cα, Bmin - smallest B-factor value for the total protein B-factors for all Cα. B-factor normalisation is based on the full length protein.

Equation2. Fi-score evaluation where N- is the total number of atoms for which dihedral angle information is available, φ and ψ values represent dihedral angles for an Cα atom, σφ and σψ represent corresponding standard deviations for the torsion angles and Bi-norm is a normalised B- factor value for the Cα atom. B-factor, σφ and σψ normalisation is based on the full length protein.

To illustrate the function’s output, a Nur77 protein (6KZ5 chain A ) was selected as a case study. FUnction PDB_prepare prepares Nur77 protein, 6KZ5 chain A, and the output includes the extracted data, including normalised B-factor score ‘B-normalised’, ‘Fi_score’ and secondary structure assignment ‘Type’.

pdb_path<-system.file("extdata", "6kz5_A.pdb", package="Fiscore")
pdb_df<-PDB_prepare(pdb_path)
## Warning in bio3d::clean.pdb(pdb_file_temp, consecutive = TRUE, force.renumber =
## FALSE, : PDB is still not clean. Try fix.chain=TRUE and/or fix.aa=TRUE
head(pdb_df)
##                 phi       psi       chi1      chi2 chi3 chi4 chi5 df_resno
##  31.A.ALA  54.94701  53.80667         NA        NA   NA   NA   NA       31
##  32.A.ASN -60.91976 -18.01379 -157.93838 -76.10748   NA   NA   NA       32
##  33.A.LEU -64.18792 -45.94048  176.74103  46.17107   NA   NA   NA       33
##  34.A.LEU -65.44501 -38.46630  -88.61643 158.49518   NA   NA   NA       34
##  35.A.THR -67.68541 -43.43184   75.18019        NA   NA   NA   NA       35
##  36.A.SER -76.88914 -17.40880  157.45159        NA   NA   NA   NA       36
##           df_res B_factor B_normalised   Fi_score                     Type
##  31.A.ALA    ALA    30.07   0.35506724 0.37663226 Right-handed alpha helix
##  32.A.ASN    ASN    15.68   0.18227666 0.07176640 Right-handed alpha helix
##  33.A.LEU    LEU     7.79   0.08753602 0.09261096 Right-handed alpha helix
##  34.A.LEU    LEU     9.85   0.11227185 0.10140389 Right-handed alpha helix
##  35.A.THR    THR    20.79   0.24363593 0.25696344 Right-handed alpha helix
##  36.A.SER    SER    23.69   0.27845821 0.13372748 Right-handed alpha helix

Please note that Bio3D package (dependency for this function) might issue a warning if your chain contains breakages. You can proceed with downstream analyses as this will be accounted for.

2.3 Exploratory analysis and applications

2.3.1. Dihedral angle distribution plot (Ramachandran plot with densities)

Function phi_psi_plot requires a PDB data frame generated by PDB_prepare. This function provides a binned density plot for dihedral angle distributions. Please note, you may get a geom_hex warning for removed lines if some of the bins clash with the boundaries. This is a geom_hex conflict and does not affect visualisation.

Nur77 protein, 6KZ5 chain A, a phi/psi value density plot.

phi_psi_plot(pdb_df)
## Warning: Removed 1 rows containing missing values (geom_hex).

2.3.2. Dihedral angle distribution plot

Function phi_psi_bar_plot plots dihedral angle distribution per amino acid using a bar plot. The supplied data frame must be generated by PDB_prepare. Some PDB files have breakages in their amino acid sequences; that is, some residues might be missing and it will be reflected in the plot via empty spaces. The plot allows the user to interactively take snapshots, zoom in to a specific region and check each bar’s data, i.e., residue number and the angle value.

Nur77 protein, 6KZ5 chain A, a phi/psi value bar plot.

phi_psi_bar_plot(pdb_df)

2.3.3. Dihedral angle distribution and secondary structure interactive plot

Function phi_psi_interactive plots a scatter plot with a secondary structure element visualisation based on the PDB file data; please note, NA refers to an unidentified region, e.g. a likely disordered region. The function requires a PDB data frame generated by PDB_prepare. You can interactively explore this plot by hovering your pointer on each data point.

Nur77 protein, 6KZ5 chain A, a phi/psi interactive scatter plot.

phi_psi_interactive(pdb_df)

2.3.4. Interactive scaled B-factor distribution

Function B_plot_normalised can be used to plot B-factor normalised values per amino acid using a bar plot; please note, some PDB files have breakages in their amino acid sequence, that is some residues might be missing and the gaps will be reflected in the plot. The function requires a PDB data frame generated by PDB_prepare.

Nur77 protein, 6KZ5 chain A, interactive scaled B-factor distribution.

B_plot_normalised(pdb_df)

2.3.5. Interactive 3D plot visualising dihedral angles and scaled B-factor distribution

Function phi_psi_3D plots a 3D scatter plot with a secondary structure element visualisation based on the PDB file data. The plot includes information, such as phi and psi dihedral angles as well as normalised B-factor values. Required parameter - a PDB data frame generated by PDB_prepare

Nur77 protein, 6KZ5 chain A, a phi/psi and normalised B-factor interactive 3D scatter plot.

phi_psi_3D(pdb_df)

2.3.6. Interactive Fi-score plot visualising score distribution

Function Fi_score_plot plots Fi-score values per amino acid using a bar plot; please note, some PDB files have breakages in their amino acid sequence, that is some residues might be missing and the gaps will be reflected in the plot. The function needs a data frame generated by PDB_prepare.

Nur77 protein, 6KZ5 chain A, a Fi-score interactive 3D scatter plot.

Fi_score_plot(pdb_df)

2.3.7. Calculate Fi-score for an individual region

Function Fi_score_region calculates a combined Fi-score for a selected region; please note, some PDB files have breakages in their amino acid sequences and those values cannot be assessed. Moreover, values can be calculated either inclusively or not; ‘include’ is set to FALSE by default. Inclusive or not refers to whether the Fi-score is calculated based on the peptide bonds (not inclusive) or consider amino acids (inclusive) (Eq. 2). This analysis can be helpful if you need to classify specific regions and compare across multiple targets or extract specific region data and store the scores in relational databases.

Fi_score_region(pdb_df,50,70)
## [1] -1.66764

2.3.8. Fi-score plot visualising secondary structure elements

Fiscore_secondary plots a bar plot with a secondary structure element visualisation based on the PDB file data; note, NA refers to an unidentified region, e.g. a likely disordered or unstructured region.

Nur77 protein, 6KZ5 chain A, a Fi-score secondary structure element visualisation.

Fiscore_secondary(pdb_df)

2.3.9. Hydrophobicity plot

Function hydrophobicity_plot provides a plot of amino acid sequence hydrophobicity profile using Kyte-Doolittle hydrophobicity scale [20,21]⁠. The Kyte-Doolittle scale is used for detecting hydrophobic regions in proteins. Regions with a positive value are hydrophobic and those with negative values are hydrophilic. This scale can be used to identify both surface-exposed regions as well as transmembrane regions, depending on the window size used. Function requires a PDB data frame generated by PDB_prepare. ‘Window’ parameter is needed to determine the size of a window for hydrophobicity calculations. The selection must be odd numbers between 3 and 21, default is 21.’ Weight’ parameter is needed to establish relative weight of the window edges compared to the window center (in %); default=100. ‘Model’ determines weather weights are calculated linearly (y=k * x+b) or exponentially (y=a * b^x); default=‘linear’. Please, note the terminal amino acids that cannot be included into the window are added without weighing; that is, their assigned hydrophobicity values by Kyte-Doolittle scale are used instead. The plot values are all scaled so that different proteins can be compared.

Nur77 protein, 6KZ5 chain A, a scaled hydrophobicity plots with different parameters.

hydrophobicity_plot(pdb_df,window = 9,weight = 25,model = "linear")

hydrophobicity_plot(pdb_df,window = 9,weight = 25,model = "exponential")

2.3.10. Gaussian Mixture Models for structural feature prediction and classification

Function cluster_ID can be used to group structural features using the Fi-score via Gaussian Mixture Models (GMM). This function also automatically selects an optimal number of clusters and a model to be fitted during the (expectation maximisation) EM phase of clustering for GMM. The function provides summaries and also helps to visualise clusters based on the Fi-score using scatter plots and dimension reduction plots. Please do not forget to set seed for reproducibility before using this function. Required parameters include a data frame containing a processed PDB file with Fi-score values, a number of clusters to consider during model selection; the default setting is 20 clusters (‘max_range’), a ‘secondary_structures’ parameter which is needed for the inclusion of the secondary structure element information from the PDB file when plotting; the default value is TRUE. Users also have an option to select a cluster number to test ‘clusters’ together with ‘modelNames’. Please note, both of the optional entries need to be selected and defined, if the user wants to test other clustering options that were not provided by the automated BIC output. This is an advanced option and the user should assess the BIC output to decide which model and what cluster number he or she wants to try out. The function will plot best Bayesian information criterion values together with the selected model, e.g. 6 clusters and VVI and the BIC plot. You can read more about these functions in mclust package documentation [22]⁠.

A dimension reduction analysis for visualizing the clustering or classification structure that was obtained from Gaussian densities is also reported. The method relies on reducing the dimensionality by identifying a set of linear combinations which are ordered by importance as quantified by the associated eigenvalues of the original features [22]⁠. This function helps to explore how well clustering captures the characteristics of the data.

df<-cluster_ID(pdb_df)

## Best BIC values:
##          VVI,6        VVE,7        VVE,6
## BIC      -2962 -2971.310677 -2971.756243
## BIC diff     0    -9.310508    -9.756073
## ----------------------------------------------------------------- 
## Dimension reduction for model-based clustering and classification 
## ----------------------------------------------------------------- 
## 
## Mixture model type: Mclust (VVI, 6) 
##         
## Clusters  n
##        1 15
##        2 16
##        3 41
##        4 87
##        5 39
##        6 34
## 
## Estimated basis vectors: 
##                      Dir1      Dir2
## Residue_number 0.00047634  0.056189
## Fi_score       0.99999989 -0.998420
## 
##                 Dir1      Dir2
## Eigenvalues  0.74861   0.55583
## Cum. %      57.38945 100.00000

df<-cluster_ID(pdb_df,clusters = 5, modelNames = "VVI")

## Best BIC values:
##          VVI,6        VVE,7        VVE,6
## BIC      -2962 -2971.310677 -2971.756243
## BIC diff     0    -9.310508    -9.756073
## ----------------------------------------------------------------- 
## Dimension reduction for model-based clustering and classification 
## ----------------------------------------------------------------- 
## 
## Mixture model type: Mclust (VVI, 5) 
##         
## Clusters  n
##        1 15
##        2 55
##        3 87
##        4 41
##        5 34
## 
## Estimated basis vectors: 
##                      Dir1     Dir2
## Residue_number -0.0063543 0.077003
## Fi_score        0.9999798 0.997031
## 
##               Dir1     Dir2
## Eigenvalues  1.016   0.4762
## Cum. %      68.088 100.0000

For GMM clustering, the user also has an option to select detailed cluster mapping (secondary_structures=TRUE; Fig.2 ) where both cluster and secondary structure elements will be visualised or one can opt out of cluster visualisation (secondary_structures=FALSE; Fig. 3). Cluster visualisation should help identify structural and functional regions. This analysis can also be done on a specified region in a protein, e.g. a specific domain. For regional analysis the data frame supplied for the function only needs to contain the data for the region of interest that was pre-filtered.

Figure 2. Nur77 protein, 6KZ5 chain A, GMM Fi-score cluster visualisation with secondary structure visualisation (interactive).

Figure 3. Nur77 protein, 6KZ5 chain A, GMM Fi-score cluster visualisation without secondary structure visualisation (interactive).

2.3.11. Density plots

Function density_plots provides a density plot set for phi/psi angle distributions, Fi-score, and normalised B-factor. 3D visualisation of angle distribution for every residue is also included. The plots can be used for a quick assessment of the overall parameters or to summarise the observations. Function requires a PDB data frame generated by PDB_prepare. You have an option to provide ‘model_report’ from cluster_ID function to incorporate analysis on how well clusters differentiate between secondary structure elements or how similar these elements are.

Nur77 protein, 6KZ5 chain A, summary plots.

density_plots(pdb_df)

Nur77 protein, 6KZ5 chain A, summary plots including data from cluster_ID.

density_plots(pdb_df, df)

3. Discussion

The described methodology and the introduced package can be integrated with other tools and machine learning models. This analysis could aid in the identification of new target families. The described fingerprinting is a novel way to capture protein data since it does not rely on the sequence information but actually measures physicochemical properties of sites of interest. Fiscore package provides both analytical and visualisation tools to interactively assess protein structural files and capture relevant amino acid interactions over a selected span of a protein sequence. This strategy can become especially relevant in the pharmaceutical industry and drug discovery as more complex targets and protein-protein interactions need to be assessed [1-5]. Fiscore package contains a number of useful functions that, for example, can be used to plot interactive Ramachandran plots or 3D visualisations with normalised B-factor values. Fiscore package also introduces an easy implementation of machine learning that does not require prior knowledge. Moreover, the described hydrophobicity evaluation and plotting functionality offers a new tool for protein structure assessment connecting the hydrophobic features with the secondary structure elements. This could be useful in deciding what region might be a good binding pocket for a drug. Finally, Fi-scores from multiple targets or selected protein domains can be aggregated and used in relational databases to associate a specific physicochemical protein feature with additional experimental readouts or bioassay data.

4. Package comments

The package has the following dependencies: Bio3D, plotly, ggplot2, mclust, stringr, lattice, stats, methods, dplyr,knitr, rmarkdown.

Please note that plotly is used to build interactive plots which will appear on the Viewer tab in RStudio.

Version: v0.1.3

Github page: https://github.com/AusteKan/Fiscore

Inquiries: auste.kan [at] algorithm379 [.] com

5. References

  1. Kanapeckaitė A, Beaurivage C, Hancock M & Verschueren E (2020) Fi-score: a novel approach to characterise protein topology and aid in drug discovery studies. J Biomol Struct Dyn, 1–11.

  2. Fauman EB, Rai BK & Huang ES (2011) Structure-based druggability assessment-identifying suitable targets for small molecule therapeutics. Curr Opin Chem Biol 15, 463–468.

  3. Schmidtke P & Barril X (2010) Understanding and predicting druggability. A high-throughput method for detection of drug binding sites. J Med Chem 53, 5858–5867.

  4. Shin S-M, Choi D-K, Jung K, Bae J, Kim J, Park S, Song K-H & Kim Y-S (2017) Antibody targeting intracellular oncogenic Ras mutants exerts anti-tumour effects after systemic administration. Nat Commun 8, 15090.

  5. Norman RA, Ambrosetti F, Bonvin AMJJ, Colwell LJ, Kelm S, Kumar S & Krawczyk K (2020) Computational approaches to therapeutic antibody design: Established methods and emerging trends. Brief Bioinform 21, 1549–1567.

  6. Obradovic Z, Peng K, Vucetic S, Radivojac P, Brown CJ & Dunker AK (2003) Predicting Intrinsic Disorder From Amino Acid Sequence. In Proteins: Structure, Function and Genetics pp. 566–572.

  7. Weisel M, Proschak E, Kriegl JM & Schneider G (2009) Form follows function: Shape analysis of protein cavities for receptor-based drug design. Proteomics 9, 451–459.

  8. Grant BJ, Rodrigues APC, Elsawy KM, Mccammon JA & Caves LSD (2006) Bio3d: an R package for the comparative analysis of protein structures. 22, 2695–2696.

  9. Bio3d: an R package for the comparative analysis of protein structures | Bioinformatics | Oxford Academic.

  10. Fuller JC, Burgoyne NJ & Jackson RM (2009) Predicting druggable binding sites at the protein-protein interface. Drug Discov Today 14, 155–161.

  11. Bentham Science Publisher BSP (2006) Methods for the Prediction of Protein-Ligand Binding Sites for Structure-Based Drug Design and Virtual Ligand Screening. Curr Protein Pept Sci 7, 395–406.

  12. Saravanan KM & Selvaraj S (2017) Dihedral angle preferences of amino acid residues forming various non-local interactions in proteins. J Biol Phys 43, 265.

  13. Zhou AQ, O’Hern CS & Regan L (2011) Revisiting the Ramachandran plot from a new angle. Protein Sci 20, 1166–1171.

  14. Faraggi E, Xue B & Zhou Y (2009) Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network. Proteins Struct Funct Bioinforma 74, 847–856.

  15. Radivojac P (2004) Protein flexibility and intrinsic disorder. Protein Sci 13, 71–80.

  16. Yin H, Li Y-Z & Li M-L (2011) On the Relation Between Residue Flexibility and Residue Interactions in Proteins. Protein Pept Lett 18, 450–456.

  17. Lüdemann SK, Carugo O & Wade RC (1997) Substrate access to cytochrome P450cam: A comparison of a thermal motion pathway analysis with molecular dynamics simulation data. J Mol Model 3, 369–374.

  18. Hartmann H, Parak F, Steigemann W, Petsko GA, Ponzi DR & Frauenfelder H (1982) Conformational substates in a protein: structure and dynamics of metmyoglobin at 80 K. Proc Natl Acad Sci U S A 79, 4967–4971.

  19. Carugo O (2018) How large B-factors can be in protein crystal structures. BMC Bioinformatics 19.

  20. Zhao G & London E (2006) An amino acid “transmembrane tendency” scale that approaches the theoretical limit to accuracy for prediction of transmembrane helices: Relationship to biological hydrophobicity. Protein Sci 15, 1987–2001.

  21. Kyte J & Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157, 105–132.

  22. Scrucca L (2013) Graphical tools for model-based mixture discriminant analysis. Adv Data Anal Classif 2013 82 8, 147–165.