prt

Lifecycle Codecov test coverage R build status pkgdown build status covr status

Building on data.frame serialization provided by fst, prt offers an interface for working with partitioned data.frames, saved as individual fst files.

Installation

You can install the development version of prt from GitHub by running

source("https://install-github.me/nbenn/prt")

Alternatively, if you have the remotes package available, the latest release is available by calling install_github() as

# install.packages("remotes")
remotes::install_github("nbenn/prt@*release")

Short demo

Creating a prt object can be done either by calling new_prt() on a list of previously created fst files or by coercing a data.frame object to prt using as_prt().

tmp <- tempfile()
dir.create(tmp)

flights <- as_prt(nycflights13::flights, n_chunks = 2L, dir = tmp)
#> fstcore package v0.9.12
#> (OpenMP was not detected, using single threaded mode)

print(flights)
#> # A prt:        336,776 × 19
#> # Partitioning: [168,388, 168,388] rows
#>          year month   day dep_time sched_dep_time dep_delay arr_time
#>         <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>       1  2013     1     1      517            515         2      830
#>       2  2013     1     1      533            529         4      850
#>       3  2013     1     1      542            540         2      923
#>       4  2013     1     1      544            545        -1     1004
#>       5  2013     1     1      554            600        -6      812
#>       …
#> 336,772  2013     9    30       NA           1455        NA       NA
#> 336,773  2013     9    30       NA           2200        NA       NA
#> 336,774  2013     9    30       NA           1210        NA       NA
#> 336,775  2013     9    30       NA           1159        NA       NA
#> 336,776  2013     9    30       NA            840        NA       NA
#> # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>

In case a prt object is created from a data.frame, the specified number of files is written to the directory of choice (a newly created directory within tempdir() by default).

list.files(tmp)
#> [1] "1.fst" "2.fst"

Subsetting and printing is closely modeled after tibble and behavior that deviates from that of tibble will most likely be considered a bug (please report). Some design choices that do set a prt object apart from a tibble include the use of data.tables for any result of a subsetting operation and the complete disregard for row.names.

In addition to standard subsetting operations involving the functions `[`(), `[[`() and `$`(), the base generic function subset() is implemented for the prt class, enabling subsetting operations using non-standard evaluation. Combined with random access to tables stored as fst files, this can make data access more efficient in cases where only a subset of the data is of interest.

jan <- flights[flights$month == 1, ]
identical(jan, subset(flights, month == 1))
#> [1] TRUE
print(jan)
#>        year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>     1: 2013     1   1      517            515         2      830            819
#>     2: 2013     1   1      533            529         4      850            830
#>     3: 2013     1   1      542            540         2      923            850
#>     4: 2013     1   1      544            545        -1     1004           1022
#>     5: 2013     1   1      554            600        -6      812            837
#>    ---                                                                         
#> 27000: 2013     1  31       NA           1325        NA       NA           1505
#> 27001: 2013     1  31       NA           1200        NA       NA           1430
#> 27002: 2013     1  31       NA           1410        NA       NA           1555
#> 27003: 2013     1  31       NA           1446        NA       NA           1757
#> 27004: 2013     1  31       NA            625        NA       NA            934
#>        arr_delay carrier flight tailnum origin dest air_time distance hour
#>     1:        11      UA   1545  N14228    EWR  IAH      227     1400    5
#>     2:        20      UA   1714  N24211    LGA  IAH      227     1416    5
#>     3:        33      AA   1141  N619AA    JFK  MIA      160     1089    5
#>     4:       -18      B6    725  N804JB    JFK  BQN      183     1576    5
#>     5:       -25      DL    461  N668DN    LGA  ATL      116      762    6
#>    ---                                                                    
#> 27000:        NA      MQ   4475  N730MQ    LGA  RDU       NA      431   13
#> 27001:        NA      MQ   4658  N505MQ    LGA  ATL       NA      762   12
#> 27002:        NA      MQ   4491  N734MQ    LGA  CLE       NA      419   14
#> 27003:        NA      UA    337    <NA>    LGA  IAH       NA     1416   14
#> 27004:        NA      UA   1497    <NA>    LGA  IAH       NA     1416    6
#>        minute           time_hour
#>     1:     15 2013-01-01 05:00:00
#>     2:     29 2013-01-01 05:00:00
#>     3:     40 2013-01-01 05:00:00
#>     4:     45 2013-01-01 05:00:00
#>     5:      0 2013-01-01 06:00:00
#>    ---                           
#> 27000:     25 2013-01-31 13:00:00
#> 27001:      0 2013-01-31 12:00:00
#> 27002:     10 2013-01-31 14:00:00
#> 27003:     46 2013-01-31 14:00:00
#> 27004:     25 2013-01-31 06:00:00

A subsetting operation on a prt object yields a data.table. If the full table is of interest, a prt-specific implementation of the as.data.table() generic is available.

unlink(tmp, recursive = TRUE)