downsize

William Michael Landau

2017-04-03

With the downsize package, you can toggle the test and production versions of your workflow with the flip of a TRUE/FALSE global option. This is helpful when your workflow takes a long time to run, you want to test it quickly, and unit testing is too reductionist to cover everything.

Basic downsizing

Say you want to analyze a large dataset.

big_data <- data.frame(x = rnorm(1e4), y = rnorm(1e4))

But for the sake of time, you want to test and debug your code on a smaller dataset. In your code, select your dataset with a call to downsize().

my_data <- downsize(big_data) # downsize(big = big_data)

Above, my_data becomes big_data if getOption("downsize") is FALSE or NULL (default). If getOption("downsize") is TRUE, big_data becomes head(big_data). You can toggle the global option downsize with calls to production_mode() and test_mode(), and you can override the option with downsize(..., downsize = L), where L is TRUE or FALSE. Check if the workflow is in test or production mode with the my_mode() function.

Example with test and production modes

Here is an example script in test mode.

library(downsize)
test_mode() # scales the workflow appropriately
my_mode() # shows if the workflow is in test or production mode
big_data <- data.frame(x = rnorm(1e4), y = rnorm(1e4)) # always large
my_data <- downsize(big_data) # either large or small
nrow(my_data) # responds to test_mode() and production_mode()
# ...more code, time-consuming if my_data is large...

To scale up the workflow up to production mode, replace test_mode() with production_mode() and leave everything else exactly the same.

library(downsize)
production_mode() # scales the workflow appropriately
my_mode() # shows if the workflow is in test or production mode
big_data <- data.frame(x = rnorm(1e4), y = rnorm(1e4)) # always large
my_data <- downsize(big_data) # either large or small
nrow(my_data) # responds to test_mode() and production_mode()
# ...more code, time-consuming if my_data is large...

An ideal workflow has multiple calls to downsize() that are configured all at once with a single call to test_mode() or production_mode() at the very beginning. Thus, tedium and human error are avoided, and the test is a close approximation to the original task at hand.

Provide your own test data

You can provide a replacement for big_data using argument small in downsize().

library(downsize)
big_data <- data.frame(x = rnorm(1e4), y = rnorm(1e4))
small_data <- data.frame(x = runif(16), y = runif(16))
test_mode()
my_mode() # getOption("downsize") is TRUE
## [1] "test mode"
my_data <- downsize(big_data, small_data) # downsize(big = big_data, small = small_data)
identical(my_data, small_data)
## [1] TRUE

If you set small yourself, be sure that subsequent code can accept both small and big. For example, if small is a data frame and big is a matrix, your code may work fine in test mode and break in production mode. In addition, downsize() will warn you if small is identical to or bigger in memory than big (disable with downsize(..., warn = FALSE)). To be safer, use the subsetting capabilities of the downsize() function.

Subsetting

The command my_data <- downsize(big = big_data) is equivalent to my_data <- downsize(big = big_data, nrow = 6). There are multiple ways to subset argument big in downsize() when it is time to scale down to test mode. As in the following examples, be sure that small is set to NULL (default). Otherwise, subsetter arguments such as dim, length, nrow, and ncol will be ignored.

test_mode()
downsize(1:10, length = 2)
## [1] 1 2
m <- matrix(1:36, ncol = 6)
downsize(m, ncol = 2)
##      [,1] [,2]
## [1,]    1    7
## [2,]    2    8
## [3,]    3    9
## [4,]    4   10
## [5,]    5   11
## [6,]    6   12
downsize(m, nrow = 2)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    7   13   19   25   31
## [2,]    2    8   14   20   26   32
downsize(m, dim = c(2, 2))
##      [,1] [,2]
## [1,]    1    7
## [2,]    2    8
downsize(data.frame(x = 1:10, y = 1:10), nrow = 5)
##   x y
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 5 5
x = array(0, dim = c(10, 100, 2, 300, 12))
dim(x)
## [1]  10 100   2 300  12
my_array <- downsize(x, dim = rep(3, 5))
dim(my_array)
## [1] 3 3 2 3 3
my_array <- downsize(x, dim = c(1, 4))
dim(my_array)
## [1]   1   4   2 300  12
my_array <- downsize(x, ncol = 1)
dim(my_array)
## [1]  10   1   2 300  12

Set random to TRUE to take a random subset of your data rather than just the first few rows or columns.

set.seed(6)
downsize(m, ncol = 2, random = T)
##      [,1] [,2]
## [1,]   19   25
## [2,]   20   26
## [3,]   21   27
## [4,]   22   28
## [5,]   23   29
## [6,]   24   30

Interchange code blocks

You can interchange entire blocks of code based on the scaling/mode of the workload.

test_mode()
downsize(big = {a = 1; a + 10}, small = {a = 1; a + 1})
## [1] 2
production_mode()
downsize(big = {a = 1; a + 10}, small = {a = 1; a + 1})
## [1] 11

Variables set in code blocks are available after calls to downsize().

test_mode()
tmp <- downsize(
  big = {
    x = "long code"
    y = 1000
  }, 
  small = {
    x = "short code"
    y = 3.14
  })
x == "short code" & y == 3.14
## [1] TRUE
production_mode()
tmp <- downsize(
  big = {
    x = "long code"
    y = 1000
  }, 
  small = {
    x = "short code"
    y = 3.14
  })
x == "long code" & y == 1000
## [1] TRUE

Help and troubleshooting

Use the help_downsize() function to obtain a collection of helpful links. For troubleshooting, please refer to TROUBLESHOOTING.md on the GitHub page for instructions.