Data Step Operations

Normally, R processes data column-by-column. The data step allows you to process data row-by-row. Row-by-row processing of data is useful when you have related columns, and wish to perform conditional logic on those columns. The datastep() function allows you to realize this style of data processing. It is particularly advantageous when you wish to perform deeply nested conditional logic. It is also very useful for by-group processing.

Example 1: Simple Data Step

Here is an example of a simple data step:

library(libr)

# Add some columns to mtcars using data step logic
df <- datastep(mtcars[1:10, 1:3], {

    if (mpg >= 20) 
      mpgcat <- "High"
    else 
      mpgcat <- "Low"
      
    recdt <- as.Date("1974-06-10")
    
    if (cyl == 8)
      is8cyl <- TRUE

  })
  
# View results  
df
#                    mpg cyl  disp      recdt mpgcat is8cyl
# Mazda RX4         21.0   6 160.0 1974-06-10   High     NA
# Mazda RX4 Wag     21.0   6 160.0 1974-06-10   High     NA
# Datsun 710        22.8   4 108.0 1974-06-10   High     NA
# Hornet 4 Drive    21.4   6 258.0 1974-06-10   High     NA
# Hornet Sportabout 18.7   8 360.0 1974-06-10    Low   TRUE
# Valiant           18.1   6 225.0 1974-06-10    Low     NA
# Duster 360        14.3   8 360.0 1974-06-10    Low   TRUE
# Merc 240D         24.4   4 146.7 1974-06-10   High     NA
# Merc 230          22.8   4 140.8 1974-06-10   High     NA
# Merc 280          19.2   6 167.6 1974-06-10    Low     NA

Keep, Drop, and Rename

The data step has parameters to perform basic shaping of the resulting data frame. These parameters are ‘keep’, ‘drop’, and ‘rename’. For example, the above data step could have been performed by sending all columns into the data step, and keeping only the desired columns. Using the keep parameter also allows you to order the resulting columns.

Example 2: Keeping Data Step Variables

library(libr)

# Keep and order output columns 
df <- datastep(mtcars[1:10,], 
  keep = c("mpg", "cyl", "disp", "mpgcat", "recdt"), {

    if (mpg >= 20) 
      mpgcat <- "High"
    else 
      mpgcat <- "Low"
      
    recdt <- as.Date("1974-06-10")
    
    if (cyl == 8)
      is8cyl <- TRUE

  })
  
df
#                    mpg cyl  disp mpgcat      recdt
# Mazda RX4         21.0   6 160.0   High 1974-06-10
# Mazda RX4 Wag     21.0   6 160.0   High 1974-06-10
# Datsun 710        22.8   4 108.0   High 1974-06-10
# Hornet 4 Drive    21.4   6 258.0   High 1974-06-10
# Hornet Sportabout 18.7   8 360.0    Low 1974-06-10
# Valiant           18.1   6 225.0    Low 1974-06-10
# Duster 360        14.3   8 360.0    Low 1974-06-10
# Merc 240D         24.4   4 146.7   High 1974-06-10
# Merc 230          22.8   4 140.8   High 1974-06-10
# Merc 280          19.2   6 167.6    Low 1974-06-10

The Retain Parameter

The retain parameter allows you to define variables that will be seeded with the value from the previous step. The retain option is useful for creating cumulative values or for performing conditions based on the value of the previous row.

Example 3: Drop, Retain, and Rename Parameters

library(libr)

df <- datastep(mtcars[1:10, ],
               drop = c("disp", "hp", "drat", "qsec",
                        "vs", "am", "gear", "carb"),
               retain = list(cumwt = 0 ),
               rename = c(mpg = "MPG", cyl = "Cylinders", wt = "Wgt",
                          cumwt = "Cumulative Wgt"), {

  cumwt <- cumwt + wt

 })

df
#                    MPG Cylinders   Wgt Cumulative Wgt
# Mazda RX4         21.0         6 2.620          2.620
# Mazda RX4 Wag     21.0         6 2.875          5.495
# Datsun 710        22.8         4 2.320          7.815
# Hornet 4 Drive    21.4         6 3.215         11.030
# Hornet Sportabout 18.7         8 3.440         14.470
# Valiant           18.1         6 3.460         17.930
# Duster 360        14.3         8 3.570         21.500
# Merc 240D         24.4         4 3.190         24.690
# Merc 230          22.8         4 3.150         27.840
# Merc 280          19.2         6 3.440         31.280

By Group Processing

The datastep() function also has the capabilities of performing by-group processing. A by-group is accomplished using the by parameter, and passing a vector of column names that define the group. Once a by-group is defined, the first. and last. automatic variables become active, which allow you to identify the boundaries between groups. Note that, by default, your data must be sorted properly before sending it into the data step. To turn the sort check off, set the sort_check parameter to FALSE.

Example 4: By Groups

library(libr)

# Identify start and end of by-groups
df <- datastep(mtcars[1:10,], 
  keep = c("mpg", "cyl", "gear", "grp"), 
  by = c("gear"), sort_check = FALSE, {

    if (first. & last.)
      grp <- "Start - End"
    else if (first.)
      grp <- "Start"
    else if (last.)
      grp <- "End"
    else 
      grp <- "-"

  })
  
df
#                    mpg cyl gear   grp
# Mazda RX4         21.0   6    4 Start
# Mazda RX4 Wag     21.0   6    4     -
# Datsun 710        22.8   4    4   End
# Hornet 4 Drive    21.4   6    3 Start
# Hornet Sportabout 18.7   8    3     -
# Valiant           18.1   6    3     -
# Duster 360        14.3   8    3   End
# Merc 240D         24.4   4    4 Start
# Merc 230          22.8   4    4     -
# Merc 280          19.2   6    4   End

Using Summary Functions

There may be times when you want to combine row-by-row conditional processing with column-by-column vector operations. For example, let’s say you want to calculate a mean and then perform conditional processing on that mean. This situation can be handled using the calculate parameter on the datastep() function. The function will execute the calculate block first, add any assigned variables to the data frame, and then execute the data step. Below is an example of such a scenario:

Example 5: Calculate Block

library(libr)

# Categorize mpg as above or below the mean
df <- datastep(mtcars, 
  keep = c("mpg", "cyl", "mean_mpg", "mpgcat"), 
  calculate = { mean_mpg = mean(mpg) },
  {

    if (mpg >= mean_mpg)
      mpgcat <- "High"
    else 
      mpgcat <- "Low"

  })
  
df[1:10,]
#                    mpg cyl mean_mpg mpgcat
# Mazda RX4         21.0   6 20.09062   High
# Mazda RX4 Wag     21.0   6 20.09062   High
# Datsun 710        22.8   4 20.09062   High
# Hornet 4 Drive    21.4   6 20.09062   High
# Hornet Sportabout 18.7   8 20.09062    Low
# Valiant           18.1   6 20.09062    Low
# Duster 360        14.3   8 20.09062    Low
# Merc 240D         24.4   4 20.09062   High
# Merc 230          22.8   4 20.09062   High
# Merc 280          19.2   6 20.09062    Low

Data Steps with dplyr

Note that the datastep() function is pipe-friendly, and can be combined with dplyr functions in a data pipeline. Also note that the datastep() function will recognize any group attributes added by the group_by() function. Therefore, within a dplyr pipeline, it is not necessary to use any datastep parameters. The following example recreates the above data frame from Example 5, but with a dplyr pipeline.

Example 6: Data Pipeline

library(libr)
library(dplyr)
library(magrittr)

# Add datastep to dplyr pipeline
df <- mtcars %>% 
  select(mpg, cyl, gear) %>% 
  mutate(mean_mpg = mean(mpg)) %>% 
  datastep({

    if (mpg >= mean_mpg)
      mpgcat <- "High"
    else 
      mpgcat <- "Low"

  }) %>% 
  filter(row_number() <= 10)
  
df
#     mpg cyl gear mean_mpg mpgcat
# 1  21.0   6    4 20.09062   High
# 2  21.0   6    4 20.09062   High
# 3  22.8   4    4 20.09062   High
# 4  21.4   6    3 20.09062   High
# 5  18.7   8    3 20.09062    Low
# 6  18.1   6    3 20.09062    Low
# 7  14.3   8    3 20.09062    Low
# 8  24.4   4    4 20.09062   High
# 9  22.8   4    4 20.09062   High
# 10 19.2   6    4 20.09062    Low

Data Attributes

The libr package recognizes several useful data attributes that are not normally recognized by other R functions. For example, it is very convenient to assign label and description attributes to your columns, so other people can understand what data the columns contain.

For this reason, the datastep() function provides an attrib parameter that allows you to supply such attributes as part of a data step. Attributes are assigned with a named list and the dsattr() object.

Example 7: Attributes

library(libr)

# Assign label attributes to all columns
df <- datastep(mtcars[1:10, ], 
               keep = c("mpg", "cyl", "mpgcat"),
               calculate = { mean_mpg = mean(mpg) },
               attrib = list(mpg = dsattr(label = "Miles Per Gallon"),
                             cyl = dsattr(label = "Cylinders"),
                             mpgcat = dsattr(label = "Mileage Category")), {

    if (mpg >= mean_mpg)
      mpgcat <- "High"
    else 
      mpgcat <- "Low"

  }) 

# View attributes in dictionary
dictionary(df)
# # A tibble: 3 x 10
#   Name  Column Class     Label            Description Format Width Justify  Rows   NAs
#   <chr> <chr>  <chr>     <chr>            <chr>       <lgl>  <int> <chr>   <int> <int>
# 1 df    mpg    numeric   Miles Per Gallon NA          NA        NA NA         10     0
# 2 df    cyl    numeric   Cylinders        NA          NA        NA NA         10     0
# 3 df    mpgcat character Mileage Category NA          NA         4 NA         10     0

Data Step Array

As mentioned previously, R typically operates in a column-wise manner. That is, R processes data column-by-column. But what if you need to get a sum or mean across a row?

This situation is what led to the development of the data step array. The data step array allows you to define a list of columns and
iterate over the list inside a data step. Data step arrays are defined with the arrays parameter, which accepts a named list of dsarray() objects.

To see the array in action, we’ll use the AirPassengers sample data. This data shows international airline passengers by month between 1949 and 1960. The data looks like this:

AirPassengers
#      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1949 112 118 132 129 121 135 148 148 136 119 104 118
# 1950 115 126 141 135 125 149 170 170 158 133 114 140
# 1951 145 150 178 163 172 178 199 199 184 162 146 166
# 1952 171 180 193 181 183 218 230 242 209 191 172 194
# 1953 196 196 236 235 229 243 264 272 237 211 180 201
# 1954 204 188 235 227 234 264 302 293 259 229 203 229
# 1955 242 233 267 269 270 315 364 347 312 274 237 278
# 1956 284 277 317 313 318 374 413 405 355 306 271 306
# 1957 315 301 356 348 355 422 465 467 404 347 305 336
# 1958 340 318 362 348 363 435 491 505 404 359 310 337
# 1959 360 342 406 396 420 472 548 559 463 407 362 405
# 1960 417 391 419 461 472 535 622 606 508 461 390 432

This example illustrates how to create row totals, row means, and find the top month using a data step array. The array has an indexer to extract values. You can use the indexer to extract a single value or a subset of values. An empty indexer will return all the values in the array.

Example 8: Using a Data Step Array

library(libr)

# Create AirPassengers Data Frame
df <- as.data.frame(t(matrix(AirPassengers, 12,
                    dimnames = list(month.abb, seq(1949, 1960)))), 
                    stringsAsFactors = FALSE)

# Use datastep array to get year tot, mean, and top month
dat <- datastep(df,
                arrays = list(months = dsarray(names(df))),
                attrib = list(Tot = 0, Mean = 0, Top = ""),
                drop = "mth",
                {

                  Tot <- sum(months[])
                  Mean <- mean(months[])

                  for (mth in months) {
                    if (months[mth] == max(months[])) {
                      Top <- mth
                    }
                  }
                })


dat
#      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec  Tot     Mean Top
# 1949 112 118 132 129 121 135 148 148 136 119 104 118 1520 126.6667 Aug
# 1950 115 126 141 135 125 149 170 170 158 133 114 140 1676 139.6667 Aug
# 1951 145 150 178 163 172 178 199 199 184 162 146 166 2042 170.1667 Aug
# 1952 171 180 193 181 183 218 230 242 209 191 172 194 2364 197.0000 Aug
# 1953 196 196 236 235 229 243 264 272 237 211 180 201 2700 225.0000 Aug
# 1954 204 188 235 227 234 264 302 293 259 229 203 229 2867 238.9167 Jul
# 1955 242 233 267 269 270 315 364 347 312 274 237 278 3408 284.0000 Jul
# 1956 284 277 317 313 318 374 413 405 355 306 271 306 3939 328.2500 Jul
# 1957 315 301 356 348 355 422 465 467 404 347 305 336 4421 368.4167 Aug
# 1958 340 318 362 348 363 435 491 505 404 359 310 337 4572 381.0000 Aug
# 1959 360 342 406 396 420 472 548 559 463 407 362 405 5140 428.3333 Aug
# 1960 417 391 419 461 472 535 622 606 508 461 390 432 5714 476.1667 Jul

Next: Disclaimer