2.B: JASPAR & rbioapi

Moosa Rezwani

2022-08-06


0.1 Introduction

Directly quoting from Fornes O, Castro-Mondragon JA, Khan A, et al:

JASPAR (https://jaspar.genereg.net) is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) for TFs across multiple species in six taxonomic groups. In this 8th release of JASPAR, the CORE collection has been expanded with 245 new PFMs (169 for vertebrates, 42 for plants, 17 for nematodes, 10 for insects, and 7 for fungi), and 156 PFMs were updated (125 for vertebrates, 28 for plants and 3 for insects). These new profiles represent an 18% expansion compared to the previous release.

source:
Fornes O, Castro-Mondragon JA, Khan A, et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2019; doi: 10.1093/nar/gkz1001


0.2 Data Organization in JASPAR

JASPAR is a database of transcription factor binding matrices with annotations and metadata. These entities are organized in a hierarchical fashion that we will explore next.

0.2.1 Releases

In addition to the latest JASPAR database release (2020), other active releases are also available. Most of the rbioapi JASPAR functions have a release argument that allows you to use other database releases.

## Call the function without any arguments to get a list of releases
releases <- rba_jaspar_releases()
## Supply a release number for details:
release_7_info <- rba_jaspar_releases(7)

0.2.2 Collections

Within a release, Matrix profiles are organized into collections, You can use rba_jaspar_collections() to get a list of available collections, or read “JASPAR Collections” section in documentation page in JASPAR web-site for a thorough review.

## To get a list of available collection in release 2020:
rba_jaspar_collections(release = 2020)
#>          name                                                        url
#> 1        CORE        https://jaspar.genereg.net/api/v1/collections/CORE/
#> 2 UNVALIDATED https://jaspar.genereg.net/api/v1/collections/UNVALIDATED/


## You can list information of all matrices available in a collection:
mat_in_core_2020 <- rba_jaspar_collections_matrices(collection = "CORE")

0.2.3 Taxonomic Groups

Within each collection, the matrix profiles are organized based on main taxonomic groups:

## To get a list of taxonomic groups in release 2020:
rba_jaspar_taxons(release = 2020)
#>             name                                                    url
#> 1         plants        https://jaspar.genereg.net/api/v1/taxon/plants/
#> 2    vertebrates   https://jaspar.genereg.net/api/v1/taxon/vertebrates/
#> 3        insects       https://jaspar.genereg.net/api/v1/taxon/insects/
#> 4   urochordates  https://jaspar.genereg.net/api/v1/taxon/urochordates/
#> 5      nematodes     https://jaspar.genereg.net/api/v1/taxon/nematodes/
#> 6          fungi         https://jaspar.genereg.net/api/v1/taxon/fungi/
#> 7     trematodes    https://jaspar.genereg.net/api/v1/taxon/trematodes/
#> 8  dictyostelium https://jaspar.genereg.net/api/v1/taxon/dictyostelium/
#> 9       cnidaria      https://jaspar.genereg.net/api/v1/taxon/cnidaria/
#> 10      oomycota      https://jaspar.genereg.net/api/v1/taxon/oomycota/


## You can list information of all matrices available in a taxonomic group:
mat_in_insects <- rba_jaspar_taxons_matrices(tax_group = "insects")

0.2.4 Species

As we go down in the data organization hierarchy, Each taxonomic group consist of species:

## To get a list of species in release 2020:
species <- rba_jaspar_species(release = 2020)
head(species)
#>   tax_id                          species
#> 1   5037           Ajellomyces capsulatus
#> 2   4151                Antirrhinum majus
#> 3  81972 Arabidopsis lyrata subsp. lyrata
#> 4   3702             Arabidopsis thaliana
#> 5   9913                       Bos taurus
#> 6   6238          Caenorhabditis briggsae
#>                                                url
#> 1  https://jaspar.genereg.net/api/v1/species/5037/
#> 2  https://jaspar.genereg.net/api/v1/species/4151/
#> 3 https://jaspar.genereg.net/api/v1/species/81972/
#> 4  https://jaspar.genereg.net/api/v1/species/3702/
#> 5  https://jaspar.genereg.net/api/v1/species/9913/
#> 6  https://jaspar.genereg.net/api/v1/species/6238/
#>                                         matrix_url
#> 1  https://jaspar.genereg.net/api/v1/species/5037/
#> 2  https://jaspar.genereg.net/api/v1/species/4151/
#> 3 https://jaspar.genereg.net/api/v1/species/81972/
#> 4  https://jaspar.genereg.net/api/v1/species/3702/
#> 5  https://jaspar.genereg.net/api/v1/species/9913/
#> 6  https://jaspar.genereg.net/api/v1/species/6238/

## You can list information of all matrices available in a specie:
mat_in_human <- rba_jaspar_species_matrices(tax_id = 9606)

0.3 Matrix Profiles

0.3.1 Search Matrix Profiles

Retrieving a list of every matrix available in a given category is not the only option. You can also build a search query using rba_jaspar_matrix_search. Note that this is a search function, you are not required to fill every argument. You may use any combination of arguments you see fit to build your query. You can even call the function without any argument to get a list of all the matrix profiles. For instance:

## Get a list of all the available matrix profile:
all_matrices <- rba_jaspar_matrix_search()

## Search FOX:
FOX_matrices <- rba_jaspar_matrix_search(term = "FOX")

## Transcription factors named FOXP3
FOXP3_matrices <- rba_jaspar_matrix_search(term = "FOXP3")

## Transcription factors of Zipper-Type Class
zipper_matrices <- rba_jaspar_matrix_search(tf_class = "Zipper-Type")

## Transcription factors of Zipper-Type Class in PBM collection
zipper_pbm_matrices <- rba_jaspar_matrix_search(tf_class = "Zipper-Type",
                                                collection = "PBM")

0.3.2 List Matrix Profiles Associated to a Base identifier

Since JASPAR release 2010, the matrix profiles are versioned. A matrix profile Identifier has a “base_id.version” naming schema; for example “MA0600.2” corresponds to the second version of a matrix with base ID MA0600. You can Use rba_jaspar_matrix_versions to get a list of matrix profiles with a given base ID. Also note that some functions, generally those that are used to list available matrices, have an argument called only_last_version.

## Get matrix profiles versions associated to a base id
MA0600_versions <- rba_jaspar_matrix_versions("MA0600")

0.3.3 Get a Matrix Profile

Now that you listed or searched for matrix profiles, you can use rba_jaspar_matrix to retrieve matrix profiles. There are two ways in which you can use this function:

0.3.3.1 Get Matrix and Annotations as an R Object

To do that, only fill in the matrix_id argument in rba_jaspar_matrix

pfm_matrix <- rba_jaspar_matrix(matrix_id = "MA0600.2")

## you can find the matrix in the pfm element along with
## other elements which correspond to annotations and details
str(pfm_matrix)
#> List of 24
#>  $ collection   : chr "CORE"
#>  $ remap_tf_name: chr "RFX2"
#>  $ sites_url    : NULL
#>  $ source       : chr "23332764"
#>  $ versions_url : chr "https://jaspar.genereg.net/api/v1/matrix/MA0600/versions"
#>  $ matrix_id    : chr "MA0600.2"
#>  $ medline      : chr "8754849"
#>  $ tffm         :List of 7
#>   ..$ log_p_1st_order: num -6275
#>   ..$ experiment_name: chr "CistromeDB_58298"
#>   ..$ tffm_id        : chr "TFFM0576.1"
#>   ..$ base_id        : chr "TFFM0576"
#>   ..$ version        : int 1
#>   ..$ tffm_url       : chr "https://jaspar.genereg.net/api/v1/tffm/TFFM0576.1/"
#>   ..$ log_p_detailed : num -6660
#>  $ uniprot_ids  : chr "P48378"
#>  $ pazar_tf_ids : list()
#>  $ sequence_logo: chr "https://jaspar.genereg.net/static/logos/svg/MA0600.2.svg"
#>  $ name         : chr "RFX2"
#>  $ tfe_id       : list()
#>  $ tax_group    : chr "vertebrates"
#>  $ pubmed_ids   : chr "8754849"
#>  $ pazar_tf_id  : list()
#>  $ species      :'data.frame':   1 obs. of  2 variables:
#>   ..$ name  : chr "Homo sapiens"
#>   ..$ tax_id: int 9606
#>  $ class        : chr "Fork head/winged helix factors"
#>  $ type         : chr "HT-SELEX"
#>  $ tfe_ids      : list()
#>  $ base_id      : chr "MA0600"
#>  $ family       : chr "RFX-related factors"
#>  $ version      : int 2
#>  $ pfm          : num [1:4, 1:16] 1381 5653 4042 2336 270 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:4] "A" "C" "G" "T"
#>   .. ..$ : NULL

0.3.3.2 Save a Matrix a File in Specific Format

JASPAR provides position frequency matrices (PFM) formatted as Raw PFM, JASPAR, TRANSFAC, YAML, and MEME. You can download a matrix profile as a file with any of these formats. To do that, You should use the file_format and save_to arguments available in rba_jaspar_matrix. There are two notes here:

  1. In this case, the function will save your matrix as a file and returns the un-parsed content of the file as a character string.

  2. The save_to argument in this function, and in fact through any rbioapi function can be used in many ways:
    2.1. save_to = NA: rbioapi will automatically generate a file path under your working directory, save the file in that path , and informs you with a message.
    2.2 save_to = file_name without path: rbioapi will save the file with your supplied name in your working directory.
    2.3. save_to = a directory path (without file): rbioapi will save the file with a proper name in that directory.
    2.4. save_to = a file path (i.e. ending with .extension): rbioapi will save the file exactly to this path. Make sure that the file extension of the path matches your requested file format. If this was not the case, rbioapi will save the file with the extension supplied in the path, but issues a warning to inform you about that.

    In any of the aforementioned cases, the file path can be absolute or relative.

## Different wqays in which you can save the matrix file:
meme_matrix1 <- rba_jaspar_matrix(matrix_id = "MA0600.2",
                                  file_format = "meme")

meme_matrix2 <- rba_jaspar_matrix(matrix_id = "MA0600.2",
                                  file_format = "meme",
                                  save_to = "my_matrix.meme")

meme_matrix3 <- rba_jaspar_matrix(matrix_id = "MA0600.2",
                                  file_format = "meme",
                                  save_to = "c:/rbioapi")

meme_matrix4 <- rba_jaspar_matrix(matrix_id = "MA0600.2",
                                  file_format = "meme",
                                  save_to = "c:/rbioapi/my_matrix.meme")

0.3.4 Get Binding Sites of a Matrix Profiles

If available, you can retrieve information on binding sites associated with a matrix profile. The information includes a data frame of genomic coordination of the binding site, URL to FASTA and BED files, along with other annotations.

## Get binding site of a matrix profile:
binding_sites <- rba_jaspar_sites(matrix_id = "MA0600.2")

0.4 TF flexible models (TFFMs)

JASPAR also stores and assigns identifiers to TF flexible models (TFFMs). Just like PFM (position frequency matrices), you can search TFFMs or retrieve information and annotations using a TFFM Identifier. TFFM IDs are versioned, meaning that they are in base_id.version format.

## Search TFFMs. This is a search function. Thus, what has been presented
## in `Search Matrix Profiles` section also applies here:

## Get a list of all the available matrix profile:
all_tffms <- rba_jaspar_tffm_search()

## Search FOX:
FOX_tffms <- rba_jaspar_tffm_search(term = "FOX")

## Transcription factors named FOXP3
FOXP3_tffms <- rba_jaspar_tffm_search(term = "FOXP3")

## Transcription factors of insects taxonomic group
insects_tffms <- rba_jaspar_tffm_search(tax_group = "insects")
## Now that you have a TFFM ID, you can retrieve it
TFFM0056 <- rba_jaspar_tffm("TFFM0056.3")
str(TFFM0056)
#> List of 10
#>  $ matrix_id      : chr "MA0039.4"
#>  $ matrix_url     : chr "https://jaspar.genereg.net/api/v1/matrix/MA0039.4/"
#>  $ tffm_id        : chr "TFFM0056.3"
#>  $ matrix_base_id : chr "MA0039"
#>  $ base_id        : chr "TFFM0056"
#>  $ experiment_name: chr "CistromeDB_33718"
#>  $ version        : int 3
#>  $ matrix_version : int 4
#>  $ detailed       :List of 5
#>   ..$ log_p       : num -6854
#>   ..$ xml         : chr "https://jaspar.genereg.net/static/TFFM/TFFM0056.3/TFFM_detailed_trained.xml"
#>   ..$ dense_logo  : chr "https://jaspar.genereg.net/static/TFFM/TFFM0056.3/TFFM_detailed_trained_dense_logo.svg"
#>   ..$ hits        : chr "https://jaspar.genereg.net/static/TFFM/TFFM0056.3/TFFM_detailed_trained.hits.svg"
#>   ..$ summary_logo: chr "https://jaspar.genereg.net/static/TFFM/TFFM0056.3/TFFM_detailed_trained_summary_logo.svg"
#>  $ first_order    :List of 5
#>   ..$ log_p       : num -7420
#>   ..$ xml         : chr "https://jaspar.genereg.net/static/TFFM/TFFM0056.3/TFFM_first_order_trained.xml"
#>   ..$ dense_logo  : chr "https://jaspar.genereg.net/static/TFFM/TFFM0056.3/TFFM_first_order_trained_dense_logo.svg"
#>   ..$ hits        : chr "https://jaspar.genereg.net/static/TFFM/TFFM0056.3/TFFM_first_order_trained.hits.svg"
#>   ..$ summary_logo: chr "https://jaspar.genereg.net/static/TFFM/TFFM0056.3/TFFM_first_order_trained_summary_logo.svg"

0.5 How to Cite?

To cite JASPAR (Please see https://jaspar.genereg.net/faq/):

To cite rbioapi: (Free access link to the article)


2 Session info

#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] rbioapi_0.7.7
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.29     R6_2.5.1          jsonlite_1.8.0    magrittr_2.0.3   
#>  [5] evaluate_0.15     httr_1.4.3        stringi_1.7.8     cachem_1.0.6     
#>  [9] rlang_1.0.4       cli_3.3.0         curl_4.3.2        rstudioapi_0.13  
#> [13] jquerylib_0.1.4   DT_0.23           bslib_0.4.0       rmarkdown_2.14   
#> [17] tools_4.2.1       stringr_1.4.0     htmlwidgets_1.5.4 crosstalk_1.2.0  
#> [21] xfun_0.31         yaml_2.3.5        fastmap_1.1.0     compiler_4.2.1   
#> [25] htmltools_0.5.3   knitr_1.39        sass_0.4.2