Cleaning and exploring NCBI data with the bioseq package

The purpose of this vignette is to illustrate the use of the bioseq package with a practical case. For a general introduction to the package please see the vignette Introduction to the bioseq package. Here we show how the functions of bioseq can be used to clean and explore real biological data. We take as example a set of rbcL sequences of Fragilaria species retrieved from NCBI (these data ared distributed with the package, see ?fragilaria). Fragilaria is a genus of freshwater and saltwater diatoms. Diatoms are a major group of microscopic algae.

Data obtained from generic repository like NCBI or BOLD often need to be filtered and cleaned before they can be used in a project. Here we show how this can be accomplished using bioseq and more particularly how to perform some basic cleaning, extract a specific barcode region from raw sequences, and use functions from other packages. Some of the code chunks presented here can be adapted to reconstruct a phylogeny for a particular taxonomic group or to prepare a reference database against which metabarcoding sequencing reads will be matched for taxonomic identification.

Fragilaria pararumpens. Photo by Dr Maria Kahlert

Loading packages

We start by loading the package bioseq. It is assumed that you already intalled the package either from CRAN using the function install.packages("bioseq) or from GitHub using remotes::install_github("fkeck/bioseq").

library(bioseq)

We additionaly load the tidyverse package. This is a meta-package which loads several useful packages that we will need for later. In particular we will use tibble and dplyr to work with tibbles. This allows to work with other types of data and to centralize everything in one place. To learn more about the tidyverse and these packages you can visit.

library(tidyverse)

Loading data

Genetic data are often delivered in FASTA format. In this example we will use rbcL sequences of Fragilaria species retrieved from NCBI and directly available as unparsed FASTA in the package. We can use the function read_fasta to directly read a FASTA file into R. Here we read a vector of character already available in R, but most of the time the argument file will be used to pass a path to a file available locally or via the network.

data(fragilaria, package = "bioseq")
fra_data <- read_fasta(fragilaria)

fra_data
#> DNA vector of 41 sequences
#> MK576039.1 Fragil...  TTCATCGCTTATGAATGTGATTTATTTGAAGAAGGTTCTTTAGCTAACTTAACAGCTTCT... + 575 bases
#> MK672908.1 Fragil...  GCTTTCATCGCTTACGAATGTGATTTATTTGAAGAAGGTTCTTTAGCTAACTTAACAGCT... + 1107 bases
#> MK672907.1 Fragil...  ATGTCTCAATCTGTATCAGAACGGACTCGAATCAAAAGTGACCGTTACGAATCTGGTGTA... + 1386 bases
#> MK672906.1 Fragil...  ATGTCTCAATCTGTATCAGAACGGACTCGAATCAAAAGTGACCGTTACGAATCTGGTGTA... + 1386 bases
#> KT072939.1 Fragil...  TTAGCATTATTCCGTATTACTCCACAACCAGGTGTAGATCCAGTAGAAGCAGCAGCAGCA... + 744 bases
#> KT072928.1 Fragil...  TTAGCATTATTCCGTATTACTCCACAACCAGGTGTAGATCCAGTAGAAGCAGCAGCAGCA... + 744 bases
#> KT072903.1 Fragil...  TTAGCATTATTCCGTATTACTCCACAACCAGGTGTAGATCCAGTAGAAGCAGCAGCAGCA... + 744 bases
#> MF092945.1 Fragil...  AGTGACCGTTACGAATCTGGTGTAATCCCTTACGCTAAAATGGGTTACTGGGATGCTTCA... + 1263 bases
#> MF092943.1 Fragil...  AGTGACCGTTACGAATCTGGTGTAATCCCTTACGCTAAAATGGGTTACTGGGATGCTTCA... + 1263 bases
#> JQ003574.1 Synedr...  AAAAANTATGGGCGTGTAGTATACGAAGGTTTAAAAGGTGGATTAGACTTCTTAAAAGAT... + 390 bases
#> JQ003570.1 Fragil...  ATTCCACACTCATACTTAAAAACATTCCAAGGTCCAGCAACTGGTATCATCGTAGAACGT... + 516 bases
#> JQ003569.1 Synedr...  AAAAACTACGGTCGTGTAGTATATGAAGGTTTAAAAGGTGGTTTAGACTTCTTAAAAGAT... + 390 bases
#> ... with 29 more sequences.

According to the output, we have a DNA vector with 41 sequences. Those sequences were retrieved directly from NCBI. Therefore, they are not aligned and they have different lenght. It is easy to get the range of lengths with the following command:

seq_nchar(fra_data) %>% range()
#> [1]  450 1510

As explained in the introduction, we want to analyze the data in a dataframe as it present several advantages. The bioseq package has been designed to work smoothly with tibbles, an extension of the standard dataframe. Below, we use the as_tibble function to turn the DNA vector into a tibble with two columns (one for the names and one for the sequences). We will also clean the names and split them into two columns. Finally we create an extra column with the length (number of characters) of each sequence.

fra_data <- tibble(label = names(fra_data), sequence = fra_data)
fra_data <- fra_data %>% 
  mutate(genbank_id = str_extract(label, "([^\\s]+)"),
         taxa = str_extract(label, "(?<= ).*")) %>% 
  select(genbank_id, taxa, sequence)

fra_data <- fra_data %>% 
  mutate(n_base = seq_nchar(sequence))

fra_data
#> # A tibble: 41 x 4
#>    genbank_id taxa           sequence                                     n_base
#>    <chr>      <chr>          <DNA>                                         <int>
#>  1 MK576039.1 Fragilaria MK… TTCATCGCTTATGAATGTGATTTATTTGAAGAAGGTTCTTTAG…    635
#>  2 MK672908.1 Fragilaria jo… GCTTTCATCGCTTACGAATGTGATTTATTTGAAGAAGGTTCTT…   1167
#>  3 MK672907.1 Fragilaria he… ATGTCTCAATCTGTATCAGAACGGACTCGAATCAAAAGTGACC…   1446
#>  4 MK672906.1 Fragilaria ag… ATGTCTCAATCTGTATCAGAACGGACTCGAATCAAAAGTGACC…   1446
#>  5 KT072939.1 Fragilaria ru… TTAGCATTATTCCGTATTACTCCACAACCAGGTGTAGATCCAG…    804
#>  6 KT072928.1 Fragilaria ca… TTAGCATTATTCCGTATTACTCCACAACCAGGTGTAGATCCAG…    804
#>  7 KT072903.1 Fragilaria cr… TTAGCATTATTCCGTATTACTCCACAACCAGGTGTAGATCCAG…    804
#>  8 MF092945.1 Fragilaria MF… AGTGACCGTTACGAATCTGGTGTAATCCCTTACGCTAAAATGG…   1323
#>  9 MF092943.1 Fragilaria MF… AGTGACCGTTACGAATCTGGTGTAATCCCTTACGCTAAAATGG…   1323
#> 10 JQ003574.1 Synedra radia… AAAAANTATGGGCGTGTAGTATACGAAGGTTTAAAAGGTGGAT…    450
#> # … with 31 more rows

Cropping sequences

In this example we are only interested in a particular region of the sequences, a short fragment used in many diatom metabarcoding studies. To extract the region of interest we can provide two nucleotide position but this solution require the sequences to be aligned. The alternative solution is to crop using nucleotide patterns. Here, we will use the forward and reverse primers delimiting the barcode region to crop the sequences. First, we create two DNA vectors one for the forward and one for the reverse primers:

FWD <- dna("AGGTGAAGTAAAAGGTTCWTACTTAAA",
           "AGGTGAAGTTAAAGGTTCWTAYTTAAA",
           "AGGTGAAACTAAAGGTTCWTACTTAAA")

REV <- dna("CAGTWGTWGGTAAATTAGAAGG",
           "CTGTTGTWGGTAAATTAGAAGG")

You may have noticed that some characters of the primers are ambiguous which means they can represent several nucleotides. To get all the combinations represented by those ambiguous primer sequences we can use the seq_disambiguate_IUPAC() function. For example for the forward primers:

seq_disambiguate_IUPAC(FWD)
#> [[1]]
#> DNA vector of 2 sequences
#> >   AGGTGAAGTAAAAGGTTCATACTTAAA
#> >   AGGTGAAGTAAAAGGTTCTTACTTAAA
#> 
#> [[2]]
#> DNA vector of 4 sequences
#> >   AGGTGAAGTTAAAGGTTCATACTTAAA
#> >   AGGTGAAGTTAAAGGTTCTTACTTAAA
#> >   AGGTGAAGTTAAAGGTTCATATTTAAA
#> >   AGGTGAAGTTAAAGGTTCTTATTTAAA
#> 
#> [[3]]
#> DNA vector of 2 sequences
#> >   AGGTGAAACTAAAGGTTCATACTTAAA
#> >   AGGTGAAACTAAAGGTTCTTACTTAAA

We can see that the three ambiguous sequences used as forward primers actually represent eight different sequences. Fortunately, the functions of bioseq can automatically recognize ambiguous nucleotides and interpret them as such. This is why it is always recommended to construct DNA/RNA/AA vectors instead of simple character vector when it is possible.

Now we crop the sequences using the forward and reverse primers as delimiters. We can expect that the returned sequences will be aligned because they correspond to a very specific region with a fixed length.

fra_data <- fra_data %>% 
  mutate(barcode = seq_crop_pattern(sequence,
                                    pattern_in = list(FWD),
                                    pattern_out = list(REV)))

fra_data
#> # A tibble: 41 x 5
#>    genbank_id taxa       sequence                 n_base barcode                
#>    <chr>      <chr>      <DNA>                     <int> <DNA>                  
#>  1 MK576039.1 Fragilari… TTCATCGCTTATGAATGTGATTT…    635 NA                     
#>  2 MK672908.1 Fragilari… GCTTTCATCGCTTACGAATGTGA…   1167 AGGTGAAGTTAAAGGTTCTTAC…
#>  3 MK672907.1 Fragilari… ATGTCTCAATCTGTATCAGAACG…   1446 AGGTGAAGTTAAAGGTTCTTAC…
#>  4 MK672906.1 Fragilari… ATGTCTCAATCTGTATCAGAACG…   1446 AGGTGAAGTTAAAGGTTCTTAC…
#>  5 KT072939.1 Fragilari… TTAGCATTATTCCGTATTACTCC…    804 NA                     
#>  6 KT072928.1 Fragilari… TTAGCATTATTCCGTATTACTCC…    804 NA                     
#>  7 KT072903.1 Fragilari… TTAGCATTATTCCGTATTACTCC…    804 NA                     
#>  8 MF092945.1 Fragilari… AGTGACCGTTACGAATCTGGTGT…   1323 AGGTGAAGTTAAAGGTTCTTAC…
#>  9 MF092943.1 Fragilari… AGTGACCGTTACGAATCTGGTGT…   1323 AGGTGAAGTTAAAGGTTCTTAC…
#> 10 JQ003574.1 Synedra r… AAAAANTATGGGCGTGTAGTATA…    450 NA                     
#> # … with 31 more rows

As expected, the barcode sequences seem to be aligned. We can also see that this operation has produced several NAs which correspond to sequences where the primers were not detected, probably because the barcode region was not included in the sequence. We could simply exclude the NA values, but it is safer to filter only the sequences whose length match exactly the expected length of the barcode (i.e. 312 bp).

fra_data <- fra_data %>% 
  filter(seq_nchar(barcode) == 312)

fra_data
#> # A tibble: 23 x 5
#>    genbank_id taxa      sequence                  n_base barcode                
#>    <chr>      <chr>     <DNA>                      <int> <DNA>                  
#>  1 MK672908.1 Fragilar… GCTTTCATCGCTTACGAATGTGAT…   1167 AGGTGAAGTTAAAGGTTCTTAC…
#>  2 MK672907.1 Fragilar… ATGTCTCAATCTGTATCAGAACGG…   1446 AGGTGAAGTTAAAGGTTCTTAC…
#>  3 MK672906.1 Fragilar… ATGTCTCAATCTGTATCAGAACGG…   1446 AGGTGAAGTTAAAGGTTCTTAC…
#>  4 MF092945.1 Fragilar… AGTGACCGTTACGAATCTGGTGTA…   1323 AGGTGAAGTTAAAGGTTCTTAC…
#>  5 MF092943.1 Fragilar… AGTGACCGTTACGAATCTGGTGTA…   1323 AGGTGAAGTTAAAGGTTCTTAC…
#>  6 EU090046.1 Fragilar… TGACCGTTACGAATCTGGTGTAAT…   1330 AGGTGAAACTAAAGGTTCTTAC…
#>  7 EU090045.1 Fragilar… TGACCGTTACGAATCTGGTGTAAT…   1330 AGGTGAAACTAAAGGTTCTTAC…
#>  8 EU090044.1 Fragilar… TGACCGTTACGAATCTGGTGTAAT…   1330 AGGTGAAACTAAAGGTTCTTAC…
#>  9 EU090043.1 Fragilar… TGACCGTTACGAATCTGGTGTAAT…   1328 AGGTGAAACTAAAGGTTCTTAC…
#> 10 EU090042.1 Fragilar… ACCGTTACGAATCTGGTGTAATCC…   1329 AGGTGAAACTAAAGGTTCTTAC…
#> # … with 13 more rows

Consensus sequences and phylogeny

We now want to reconstruct a phylogenetic tree for the sequences. We can include all the sequences in the tree but for clarity we want keep only one sequence per taxa. Instead of selecting one sequence randomly we compute a consensus sequence for each taxa using the function seq_consensus(). There are different methods to compute a consensus sequence implemented in the function, you can check the manual for details. The default method used below uses the majority rule.

fra_consensus <- fra_data %>% 
  group_by(taxa) %>% 
  summarise(consensus_barcode = seq_consensus(barcode))
#> `summarise()` ungrouping output (override with `.groups` argument)

fra_consensus
#> # A tibble: 20 x 2
#>    taxa              consensus_barcode                                          
#>    <chr>             <DNA>                                                      
#>  1 Fragilaria EU090… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAGTTTAC…
#>  2 Fragilaria EU090… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAGTTTAC…
#>  3 Fragilaria EU090… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAGTTTAC…
#>  4 Fragilaria EU090… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAGTTTAC…
#>  5 Fragilaria EU090… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAGTTTAC…
#>  6 Fragilaria EU090… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAGTTTAC…
#>  7 Fragilaria EU090… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAGTTTAC…
#>  8 Fragilaria KC969… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAGTTTAC…
#>  9 Fragilaria KC969… AGGTGAAACTAAAGGTTCTTACTTAAACGTTACAGCTGCTACTATGGAAGATGTATAC…
#> 10 Fragilaria MF092… AGGTGAAGTTAAAGGTTCTTACTTAAACATTACTGCTGCTACTATGGATGAACTTTAC…
#> 11 Fragilaria MF092… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAACTTTAC…
#> 12 Fragilaria agnes… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAACTTTAC…
#> 13 Fragilaria bidens AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAACTTTAC…
#> 14 Fragilaria croto… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAACTTTAC…
#> 15 Fragilaria famel… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAAGTTTAC…
#> 16 Fragilaria heath… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAGCTTTAC…
#> 17 Fragilaria joach… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGAACTTTAC…
#> 18 Fragilaria permi… AGGTGAAGTTAAAGGTTCTTACTTAAACATTACTGCTGCTACTATGGAAGAACTTTAC…
#> 19 Fragilaria rumpe… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACGGCTGCTACTATGGAAGAACTTTAC…
#> 20 Fragilaria stria… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGAAGTTTAC…

Next, we will use ape to do a quick phylogenetic reconstruction of these sequences. We can use the function as_DNAbin to convert a tibble into a DNAbin object and then use the functions of ape to reconstruct and plot a phylogenetic tree with the distance-based BIONJ algorithm.

fra_consensus %>% 
  as_DNAbin(consensus_barcode, taxa) %>% 
  ape::dist.dna() %>% 
  ape::bionj() %>% 
  plot()

Clustering sequences

The tree above suggests that several sequences are exactly the same. This can be quickly assessed:

duplicated(fra_consensus$consensus_barcode)
#>  [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Indeed, several sequences are duplicates and we may want to keep only unique sequences. We could select only unique sequence but here we rather use the seq_cluster function to cluster sequences based on their similarity.

fra_consensus <- 
  fra_consensus %>% 
  mutate(cluster = seq_cluster(consensus_barcode,
                               threshold = 0.001))
fra_consensus
#> # A tibble: 20 x 3
#>    taxa             consensus_barcode                                    cluster
#>    <chr>            <DNA>                                                  <int>
#>  1 Fragilaria EU09… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGA…       1
#>  2 Fragilaria EU09… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGA…       2
#>  3 Fragilaria EU09… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGA…       2
#>  4 Fragilaria EU09… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGA…       2
#>  5 Fragilaria EU09… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGA…       2
#>  6 Fragilaria EU09… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGA…       2
#>  7 Fragilaria EU09… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGA…       2
#>  8 Fragilaria KC96… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGA…       1
#>  9 Fragilaria KC96… AGGTGAAACTAAAGGTTCTTACTTAAACGTTACAGCTGCTACTATGGAAGA…       3
#> 10 Fragilaria MF09… AGGTGAAGTTAAAGGTTCTTACTTAAACATTACTGCTGCTACTATGGATGA…       4
#> 11 Fragilaria MF09… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGA…       5
#> 12 Fragilaria agne… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGA…       6
#> 13 Fragilaria bide… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGA…       7
#> 14 Fragilaria crot… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGA…       8
#> 15 Fragilaria fame… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGA…       9
#> 16 Fragilaria heat… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGA…      10
#> 17 Fragilaria joac… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACTGCTGCTACTATGGAAGA…      11
#> 18 Fragilaria perm… AGGTGAAGTTAAAGGTTCTTACTTAAACATTACTGCTGCTACTATGGAAGA…      12
#> 19 Fragilaria rump… AGGTGAAGTTAAAGGTTCTTACTTAAACATCACGGCTGCTACTATGGAAGA…      13
#> 20 Fragilaria stri… AGGTGAAACTAAAGGTTCTTACTTAAACATCACAGCTGCTACTATGGAAGA…       1

Finally we recompute a consensus sequence for each cluster and we reconstruct the phylogenetic tree.

fra_consensus <-
  fra_consensus %>% 
  group_by(cluster) %>% 
  summarise(taxa_group = paste(taxa, collapse = "/"),
            consensus_barcode = seq_consensus(consensus_barcode))
#> `summarise()` ungrouping output (override with `.groups` argument)

fra_consensus %>% 
  as_DNAbin(consensus_barcode, taxa_group) %>% 
  ape::dist.dna() %>% 
  ape::bionj() %>% 
  plot()

It is important to note that this tree is not an accurate phylogenetic reconstruction but it can help to visualise the hierarchical relationships between the sequences. This is particularly useful with larger datasets to detect polyphyletic clades unresolved by the barcode region.

Exporting data

When we are happy with the results, we usually want to export the data. Here, we could export directly the dataframe using write_csv(). However, it might be more convenient to export sequences in FASTA to open them in an other software later. This can be achieved using the function write_fasta().

fra_consensus %>% 
  select(taxa_group, consensus_barcode) %>% 
  deframe() %>% 
  write_fasta("my_sequences.fasta")