Backends for taxadb

Carl Boettiger, Kari Norman

2022-05-03

taxadb is designed to work with a variety of different “backends” – software that works under the hood to store and retrieve the requested data. taxadb has an intelligent default method selector which will attempt to use the best method available on your system, which means you can use taxadb without having to worry about these details. However, to improve performance of taxadb, becoming familiar with these backends can yield significant improvements in performance.

RSQLite

RSQLite is the default database backend if no suggested backend is detected. RSQLite has no external software dependencies and will be automatically installed with taxadb (it is a hard dependency as an imported rather than suggested package). The term Lite indicates that SQLite does not require the separate “server” and “client” software model found on traditional databases such as MySQL, and SQLite is widely used in consumer software everywhere. RSQLite packages SQLite for R. It enables persistent local storage for R applications but will be slower than the alternatives. For certain operations it can be significantly slower.

MonetDBLite & duckdb

MonetDBLite is a modern alternative to RSQLite. MonetDBLite is both more powerful than SQLite (in supporting a greater array of operations), and can run much faster. Filtering joins in particular can be much faster even than the in-memory operations of dplyr. Because filtering joins lie at the heart of many taxadb functions this can yield substantial improvements in performance. Unfortunately, the R interface, MonetDBLite was removed from CRAN in April 2019. The package can still be installed from GitHub by running devtools::install_github("hannesmuehleisen/MonetDBLite-R"), though this requires the appropriate compilers. The developer plans to replace MonetDBLite with duckdb, (see https://github.com/duckdb/duckdb), but this is not yet feature complete and thus not yet fully compatible for taxadb use. Because installation is more difficult, MonetDBLite is not a required dependency, but will be used by default if taxadb detects an existing installation. duckdb support will be switched on as the first priority in the method waterfall.

in-memory

taxadb can also be set to use in-memory only, without a backend. (Note that this is distinct from using RSQlite or MonetDBLite with over in-memory mode, because it uses only native R data.frames to store data). This will tend to be faster that RSQLite but slower than MonetDBLite or duckdb. In this mode, data will persist over a single session but not between sessions (since memory is cleared when the user quits out of R). Note that many taxonomic tables are quite large when uncompressed, and users with less than 8-16GB of free RAM may find their machine becomes slow or unresponsive in this mode.

Manual control of the backend engine

Users can override the automatic preferences of taxadb by setting the environmental variable TAXADB_DRIVER. For example, running Sys.setenv(TAXADB_DRIVER="RSQLite") will make RSQLite the default driver, even if MonetDBLite is installed.

Local storage

The first time taxadb accesses a data source, it will download and store the full dataset from that provider. Users can trigger a download ahead of time by running td_create(), e.g. td_create("fb") will create a local copy of the FishBase taxonomy. If a user does not call td_create() first, taxadb simply downloads the data the first time that provider is queried – e.g. filter_name("Homo sapiens", "gibf") will first download and install GBIF if that has not been done already. These download and install operations may be slow depending on your internet connection, but need be performed only once. Downloaded data is stored on your local harddisk and will persist between R sessions. The default location depends on the default set by your operating system (see the rappdirs package). Users can configure this location by setting the environmental variable TAXADB_HOME. For example, all unit tests in the package use temporary storage by setting Sys.setenv(TAXADB_HOME=tempdir()), which is cleared out after the R session ends.

A user can install all available name providers up front with td_create("all"). An overview of the available scientific name providers is found in the providers vignette.

Other backends

taxadb will work just as well with any DBI-compatible database backend (Postgres, MariaDB, etc). All taxadb functions take an argument taxadb_db, which is just a DBI connection used by dplyr. For example, we can create an in-memory RSQLite connection and use that to store data for a single session:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
taxadb::get_ids("Homo sapiens", taxadb_db = con)

Users can also call the td_connect() function to connect to taxadb’s default databases. Running td_connect() with no arguments will return the current default connection. This is a convenient way to confirm that your system is using the database engine you intended it to use. You can also use that connection to interact directly with the taxadb databases (e.g. using dplyr or DBI functions).