dm

Lifecycle: stable R build status Codecov test coverage

CRAN status Launch posit.cloud

Are you using multiple data frames or database tables in R? Organize them with dm.

Overview

dm bridges the gap in the data pipeline between individual data frames and relational databases. It’s a grammar of joined tables that provides a consistent set of verbs for consuming, creating, and deploying relational data models. For individual researchers, it broadens the scope of datasets they can work with and how they work with them. For organizations, it enables teams to quickly and efficiently create and share large, complex datasets.

dm objects encapsulate relational data models constructed from local data frames or lazy tables connected to an RDBMS. dm objects support the full suite of dplyr data manipulation verbs along with additional methods for constructing and verifying relational data models, including key selection, key creation, and rigorous constraint checking. Once a data model is complete, dm provides methods for deploying it to an RDBMS. This allows it to scale from datasets that fit in memory to databases with billions of rows.

Features

dm makes it easy to bring an existing relational data model into your R session. As the dm object behaves like a named list of tables it requires little change to incorporate it within existing workflows. The dm interface and behavior is modeled after dplyr, so you may already be familiar with many of its verbs. dm also offers:

That’s just the tip of the iceberg. See Getting started to hit the ground running and explore all the features.

Installation

The latest stable version of the {dm} package can be obtained from CRAN with the command

install.packages("dm")

The latest development version of {dm} can be installed from R-universe:

# Enable repository from cynkra
options(
  repos = c(
    cynkra = "https://cynkra.r-universe.dev",
    CRAN = "https://cloud.r-project.org"
  )
)
# Download and install dm in R
install.packages('dm')

or from GitHub:

# install.packages("devtools")
devtools::install_github("cynkra/dm")

Usage

Create a dm object (see Getting started for details).

library(dm)
dm <- dm_nycflights13()
dm
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 4

dm is a named list of tables:

names(dm)
#> [1] "airlines" "airports" "flights"  "planes"   "weather"
nrow(dm$airports)
#> [1] 86
dm$flights %>%
  count(origin)
#> # A tibble: 3 × 2
#>   origin     n
#>   <chr>  <int>
#> 1 EWR      641
#> 2 JFK      602
#> 3 LGA      518

Visualize relationships at any time:

dm %>%
  dm_draw()

Simple joins:

dm %>%
  dm_flatten_to_tbl(flights)
#> Renaming ambiguous columns: %>%
#>   dm_rename(flights, flights.year = year) %>%
#>   dm_rename(flights, flights.month = month) %>%
#>   dm_rename(flights, flights.day = day) %>%
#>   dm_rename(flights, flights.hour = hour) %>%
#>   dm_rename(airlines, airlines.name = name) %>%
#>   dm_rename(airports, airports.name = name) %>%
#>   dm_rename(planes, planes.year = year) %>%
#>   dm_rename(weather, weather.year = year) %>%
#>   dm_rename(weather, weather.month = month) %>%
#>   dm_rename(weather, weather.day = day) %>%
#>   dm_rename(weather, weather.hour = hour)
#> # A tibble: 1,761 × 48
#>    flight… fligh… fligh… dep_t… sched… dep_d… arr_t… sched… arr_d… carri… flight
#>      <int>  <int>  <int>  <int>  <int>  <dbl>  <int>  <int>  <dbl> <chr>   <int>
#>  1    2013      1     10      3   2359      4    426    437    -11 B6        727
#>  2    2013      1     10     16   2359     17    447    444      3 B6        739
#>  3    2013      1     10    450    500    -10    634    648    -14 US       1117
#>  4    2013      1     10    520    525     -5    813    820     -7 UA       1018
#>  5    2013      1     10    530    530      0    824    829     -5 UA        404
#>  6    2013      1     10    531    540     -9    832    850    -18 AA       1141
#>  7    2013      1     10    535    540     -5   1015   1017     -2 B6        725
#>  8    2013      1     10    546    600    -14    645    709    -24 B6        380
#>  9    2013      1     10    549    600    -11    652    724    -32 EV       6055
#> 10    2013      1     10    550    600    -10    649    703    -14 US       2114
#> # … with 1,751 more rows, and 37 more variables: tailnum <chr>, origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, flights.hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, airlines.name <chr>, airports.name <chr>,
#> #   lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>,
#> #   planes.year <int>, type <chr>, manufacturer <chr>, model <chr>,
#> #   engines <int>, seats <int>, speed <int>, engine <chr>, weather.year <int>,
#> #   weather.month <int>, weather.day <int>, weather.hour <int>, temp <dbl>, …

Check consistency:

dm %>%
  dm_examine_constraints()
#> ! Unsatisfied constraints:
#>  Table `flights`: foreign key `tailnum` into table `planes`: values of `flights$tailnum` not in `planes$tailnum`: N725MQ (6), N537MQ (5), N722MQ (5), N730MQ (5), N736MQ (5), …

Learn more in the Getting started article.

Getting help

If you encounter a clear bug, please file an issue with a minimal reproducible example on GitHub. For questions and other discussion, please use community.rstudio.com.


License: MIT © cynkra GmbH.

Funded by:

energie360° cynkra


Please note that the ‘dm’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.