Getting started

This vignette explains how to use the methods available in this package.

The TidySet class

This is a basic example which shows you how to create a TidySet object, to store associations between genes and sets:

library("BaseSet")
gene_lists <- list(
    geneset1 = c("A", "B"),
    geneset2 = c("B", "C", "D")
)
tidy_set <- tidySet(gene_lists)
tidy_set
#>   elements     sets fuzzy
#> 1        A geneset1     1
#> 2        B geneset1     1
#> 3        B geneset2     1
#> 4        C geneset2     1
#> 5        D geneset2     1

This is then stored internally in three slots relations, elements, and sets slots.

If you have more information for each element or set it can be added:

gene_data <- data.frame(
    stat1     = c( 1,   2,   3,   4 ),
    info1     = c("a", "b", "c", "d")
)

tidy_set <- add_column(tidy_set, "elements", gene_data)
set_data <- data.frame(
    Group     = c(      100,        200 ),
    Colum     = c(     "abc",      "def")
)
tidy_set <- add_column(tidy_set, "sets", set_data)
tidy_set
#>   elements     sets fuzzy Group Colum stat1 info1
#> 1        A geneset1     1   100   abc     1     a
#> 2        B geneset1     1   100   abc     2     b
#> 3        B geneset2     1   200   def     2     b
#> 4        C geneset2     1   200   def     3     c
#> 5        D geneset2     1   200   def     4     d

This data is stored in one of the three slots, which can be directly accessed using their getter methods:

relations(tidy_set)
#>   elements     sets fuzzy
#> 1        A geneset1     1
#> 2        B geneset1     1
#> 3        B geneset2     1
#> 4        C geneset2     1
#> 5        D geneset2     1
elements(tidy_set)
#>   elements stat1 info1
#> 1        A     1     a
#> 2        B     2     b
#> 3        C     3     c
#> 4        D     4     d
sets(tidy_set)
#>       sets Group Colum
#> 1 geneset1   100   abc
#> 2 geneset2   200   def

You can add as much information as you want, with the only restriction for a “fuzzy” column for the relations. See the Fuzzy sets vignette.

Creating a TidySet

As you can see it is possible to create a TidySet from a list and a data.frame, but it is also possible from a matrix:

m <- matrix(c(0, 0, 1, 1, 1, 1, 0, 1, 0), ncol = 3, nrow =3,  
               dimnames = list(letters[1:3], LETTERS[1:3]))
m
#>   A B C
#> a 0 1 0
#> b 0 1 1
#> c 1 1 0
tidy_set <- tidySet(m)

Or they can be created from a GeneSet and GeneSetCollection objects. Additionally it has several function to read files related to sets like the OBO files (getOBO) and GAF (getGAF)

Converting to other formats

It is possible to extract the gene sets as a list, for use with functions such as lapply.

as.list(tidy_set)
#> $A
#> c 
#> 1 
#> 
#> $B
#> a b c 
#> 1 1 1 
#> 
#> $C
#> b 
#> 1

Or if you need to apply some network methods and you need a matrix, you can create it with incidence:

incidence(tidy_set)
#>   A B C
#> c 1 1 0
#> a 0 1 0
#> b 0 1 1

Operations with sets

To work with sets several methods are provided. In general you can provide a new name for the resulting set of the operation, but if you don’t one will be automatically provided using naming. All methods work with fuzzy and non-fuzzy sets

Union

You can make a union of two sets present on the same object.

BaseSet::union(tidy_set, sets = c("C", "B"), name = "D")
#>   elements sets fuzzy
#> 1        a    D     1
#> 2        b    D     1
#> 3        c    D     1

Intersection

intersection(tidy_set, sets = c("A", "B"), name = "D", keep = TRUE)
#>   elements sets fuzzy
#> 1        c    A     1
#> 2        a    B     1
#> 3        b    B     1
#> 4        c    B     1
#> 5        b    C     1
#> 6        c    D     1

The keep argument used here is if you want to keep all the other previous sets:

intersection(tidy_set, sets = c("A", "B"), name = "D", keep = FALSE)
#>   elements sets fuzzy
#> 1        c    D     1

Complement

We can look for the complement of one or several sets:

complement_set(tidy_set, sets = c("A", "B"))
#>   elements sets fuzzy
#> 1        c    A     1
#> 2        a    B     1
#> 3        b    B     1
#> 4        c    B     1
#> 5        b    C     1
#> 6        c ∁A∪B     0
#> 7        a ∁A∪B     0
#> 8        b ∁A∪B     0

Observe that we haven’t provided a name for the resulting set but we can provide one if we prefer to

complement_set(tidy_set, sets = c("A", "B"), name = "F")
#>   elements sets fuzzy
#> 1        c    A     1
#> 2        a    B     1
#> 3        b    B     1
#> 4        c    B     1
#> 5        b    C     1
#> 6        c    F     0
#> 7        a    F     0
#> 8        b    F     0

Subtract

This is the equivalent of setdiff, but clearer:

out <- subtract(tidy_set, set_in = "A", not_in = "B", name = "A-B")
out
#>   elements sets fuzzy
#> 1        c    A     1
#> 2        a    B     1
#> 3        b    B     1
#> 4        c    B     1
#> 5        b    C     1
name_sets(out)
#> [1] "A"   "B"   "C"   "A-B"
subtract(tidy_set, set_in = "B", not_in = "A", keep = FALSE)
#>   elements sets fuzzy
#> 1        a  B∖A     1
#> 2        b  B∖A     1

See that in the first case there isn’t any element present in B not in set A, but the new set is stored. In the second use case we focus just on the elements that are present on B but not in A.

Additional information

The number of unique elements and sets can be obtained using the nElements and nSets methods.

nElements(tidy_set)
#> [1] 3
nSets(tidy_set)
#> [1] 3
nRelations(tidy_set)
#> [1] 5

The size of each gene set can be obtained using the set_size method.

set_size(tidy_set, "A")
#>   sets size probability
#> 1    A    1           1

Conversely, the number of sets associated with each gene is returned by the element_size function.

element_size(tidy_set)
#>   elements size probability
#> 1        c    2           1
#> 2        a    1           1
#> 3        b    2           1

The identifiers of elements and sets can be inspected and renamed using name_elements and

name_elements(tidy_set)
#> [1] "c" "a" "b"
name_elements(tidy_set) <- paste0("Gene", seq_len(nElements(tidy_set)))
name_elements(tidy_set)
#> [1] "Gene1" "Gene2" "Gene3"
name_sets(tidy_set)
#> [1] "A" "B" "C"
name_sets(tidy_set) <- paste0("Geneset", seq_len(nSets(tidy_set)))
name_sets(tidy_set)
#> [1] "Geneset1" "Geneset2" "Geneset3"

Using dplyr verbs

You can also use mutate, filter and other dplyr verbs with TidySets (with the only exception being group_by), but you usually need to activate which three slots you want to affect with activate:

library("dplyr")
m_TS <- tidy_set %>% 
  activate("relations") %>% 
  mutate(Important = runif(nRelations(tidy_set)))
m_TS
#>   elements     sets fuzzy  Important
#> 1    Gene1 Geneset1     1 0.03576740
#> 2    Gene2 Geneset2     1 0.00581606
#> 3    Gene3 Geneset2     1 0.96527650
#> 4    Gene1 Geneset2     1 0.91654958
#> 5    Gene3 Geneset3     1 0.66068543

You can use activate to select what are the verbs modifying:

set_modified <- m_TS %>% 
  activate("elements") %>% 
  mutate(Pathway = if_else(elements %in% c("Gene1", "Gene2"), 
                           "pathway1", 
                           "pathway2"))
set_modified
#>   elements     sets fuzzy  Important  Pathway
#> 1    Gene1 Geneset1     1 0.03576740 pathway1
#> 2    Gene2 Geneset2     1 0.00581606 pathway1
#> 3    Gene3 Geneset2     1 0.96527650 pathway2
#> 4    Gene1 Geneset2     1 0.91654958 pathway1
#> 5    Gene3 Geneset3     1 0.66068543 pathway2
set_modified %>% 
  deactivate() %>% # To apply a filter independently of where it is
  filter(Pathway == "pathway1")
#>   elements     sets fuzzy  Important  Pathway
#> 1    Gene1 Geneset1     1 0.03576740 pathway1
#> 2    Gene2 Geneset2     1 0.00581606 pathway1
#> 3    Gene1 Geneset2     1 0.91654958 pathway1

If you think you need group_by usually this would mean that you need a new set. You can create a new one with group. If you want to use group_by to group some elements then you need to create a new set:

# A new group of those elements in pathway1 and with Important == 1
set_modified %>% 
  deactivate() %>% 
  group(name = "new", Pathway == "pathway1")
#>   elements     sets fuzzy  Important  Pathway
#> 1    Gene1 Geneset1     1 0.03576740 pathway1
#> 2    Gene2 Geneset2     1 0.00581606 pathway1
#> 3    Gene3 Geneset2     1 0.96527650 pathway2
#> 4    Gene1 Geneset2     1 0.91654958 pathway1
#> 5    Gene3 Geneset3     1 0.66068543 pathway2
#> 6    Gene1      new     1         NA pathway1
#> 7    Gene2      new     1         NA pathway1
set_modified %>% 
  group("pathway1", elements %in% c("Gene1", "Gene2"))
#>   elements     sets fuzzy  Important  Pathway
#> 1    Gene1 Geneset1     1 0.03576740 pathway1
#> 2    Gene2 Geneset2     1 0.00581606 pathway1
#> 3    Gene3 Geneset2     1 0.96527650 pathway2
#> 4    Gene1 Geneset2     1 0.91654958 pathway1
#> 5    Gene3 Geneset3     1 0.66068543 pathway2
#> 6    Gene1 pathway1     1         NA pathway1
#> 7    Gene2 pathway1     1         NA pathway1

After grouping or mutating sometimes we might be interested in moving a column describing something to other places. We can do by this with:

elements(set_modified)
#>   elements  Pathway
#> 1    Gene1 pathway1
#> 2    Gene2 pathway1
#> 3    Gene3 pathway2
out <- move_to(set_modified, "elements", "relations", "Pathway")
relations(out)
#>   elements     sets fuzzy  Important  Pathway
#> 1    Gene1 Geneset1     1 0.03576740 pathway1
#> 2    Gene2 Geneset2     1 0.00581606 pathway1
#> 3    Gene3 Geneset2     1 0.96527650 pathway2
#> 4    Gene1 Geneset2     1 0.91654958 pathway1
#> 5    Gene3 Geneset3     1 0.66068543 pathway2

Session info

#> R version 4.0.1 (2020-06-06)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] reactome.db_1.74.0   forcats_0.5.1        ggplot2_3.3.5       
#>  [4] GO.db_3.12.1         org.Hs.eg.db_3.12.0  AnnotationDbi_1.52.0
#>  [7] IRanges_2.24.1       S4Vectors_0.28.1     Biobase_2.50.0      
#> [10] BiocGenerics_0.36.1  dplyr_1.0.6          BaseSet_0.0.17      
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.1  xfun_0.24         bslib_0.2.5       purrr_0.3.4      
#>  [5] colorspace_2.0-1  vctrs_0.3.8       generics_0.1.0    htmltools_0.5.1.1
#>  [9] yaml_2.2.1        utf8_1.2.1        blob_1.2.1        XML_3.99-0.6     
#> [13] rlang_0.4.11      jquerylib_0.1.4   pillar_1.6.1      withr_2.4.2      
#> [17] glue_1.4.2        DBI_1.1.1         bit64_4.0.5       lifecycle_1.0.0  
#> [21] stringr_1.4.0     munsell_0.5.0     gtable_0.3.0      memoise_2.0.0    
#> [25] evaluate_0.14     labeling_0.4.2    knitr_1.33        fastmap_1.1.0    
#> [29] fansi_0.5.0       highr_0.9         GSEABase_1.52.1   Rcpp_1.0.6       
#> [33] xtable_1.8-4      scales_1.1.1      cachem_1.0.5      graph_1.68.0     
#> [37] annotate_1.68.0   jsonlite_1.7.2    farver_2.1.0      bit_4.0.4        
#> [41] digest_0.6.27     stringi_1.6.2     grid_4.0.1        tools_4.0.1      
#> [45] magrittr_2.0.1    sass_0.4.0        RSQLite_2.2.7     tibble_3.1.2     
#> [49] crayon_1.4.1      pkgconfig_2.0.3   ellipsis_0.3.2    assertthat_0.2.1 
#> [53] rmarkdown_2.9     httr_1.4.2        R6_2.5.0          compiler_4.0.1