A common data manipulation need is: applying the same operation to a number of columns.
In the rquery
R
package we strongly recommend using value-oriented (or standard evaluation) for this task.
Here is what this looks like.
For our data lets build a simple data set similar to mtcars
:
library(wrapr)
library(rquery)
<- requireNamespace('rqdatatable', quietly = TRUE)
have_rqdatatable
<- wrapr::build_frame(
example_data "mpg" , "cyl", "disp", "car" |
21 , 6 , 160 , "Mazda RX4" |
21 , 6 , 160 , "Mazda RX4 Wag" |
22.8 , 4 , 108 , "Datsun 710" |
21.4 , 6 , 258 , "Hornet 4 Drive" |
18.7 , 8 , 360 , "Hornet Sportabout" |
18.1 , 6 , 225 , "Valiant" )
::kable(example_data) knitr
mpg | cyl | disp | car |
---|---|---|---|
21.0 | 6 | 160 | Mazda RX4 |
21.0 | 6 | 160 | Mazda RX4 Wag |
22.8 | 4 | 108 | Datsun 710 |
21.4 | 6 | 258 | Hornet 4 Drive |
18.7 | 8 | 360 | Hornet Sportabout |
18.1 | 6 | 225 | Valiant |
Now suppose for a number of columns we wish to perform a calculation, such centering it with respect to the grand average.
This is easily handled by first specifying the set of variables we wish to work with.
<- setdiff(colnames(example_data), 'car')
vars
print(vars)
## [1] "mpg" "cyl" "disp"
Now we build up what we want as a name-vector of strings using the :=
named map builder.
<- vars := paste0(vars, ' - mean(', vars, ')')
expressions
print(expressions)
## mpg cyl disp
## "mpg - mean(mpg)" "cyl - mean(cyl)" "disp - mean(disp)"
The idea is: the :=
operator fits into R
idiom by looking very much like a vectorized version of “names get assigned expressions”.
These expressions can then be used in an rquery
pipeline using the _se
-variant of extend()
: extend_se()
.
<- local_td(example_data) %.>%
ops extend_se(., expressions)
cat(format(ops))
## mk_td("example_data", c(
## "mpg",
## "cyl",
## "disp",
## "car")) %.>%
## extend(.,
## mpg := mpg - mean(mpg),
## cyl := cyl - mean(cyl),
## disp := disp - mean(disp))
And this operator pipeline is ready to use (assuming we have rqdatatable
available):
if(have_rqdatatable) {
%.>%
example_data %.>%
ops ::kable(.)
knitr }
mpg | cyl | disp | car |
---|---|---|---|
0.5 | 0 | -51.83333 | Mazda RX4 |
0.5 | 0 | -51.83333 | Mazda RX4 Wag |
2.3 | -2 | -103.83333 | Datsun 710 |
0.9 | 0 | 46.16667 | Hornet 4 Drive |
-1.8 | 2 | 148.16667 | Hornet Sportabout |
-2.4 | 0 | 13.16667 | Valiant |
The expression construction can also be done inside the extend_se()
operator.
<- local_td(example_data) %.>%
ops extend_se(., vars := paste0(vars, ' - mean(', vars, ')'))
cat(format(ops))
## mk_td("example_data", c(
## "mpg",
## "cyl",
## "disp",
## "car")) %.>%
## extend(.,
## mpg := mpg - mean(mpg),
## cyl := cyl - mean(cyl),
## disp := disp - mean(disp))
Note: the above is only a notional example to demonstrate the operations; for supervised machine learning we would probably use base::scale()
, which saves the learned centering for later re-use on new data.
The point is: we use standard R
tools to build up the lists of names and operations. We are not restricted to any single argument manipulation grammar.
For example we could build all interaction terms as follows.
<- t(combn(vars, 2))
combos <-
interactions paste0(combos[, 1], '_', combos[, 2]) :=
paste0(combos[, 1], ' * ', combos[, 2])
print(interactions)
## mpg_cyl mpg_disp cyl_disp
## "mpg * cyl" "mpg * disp" "cyl * disp"
<- local_td(example_data) %.>%
ops extend_se(., interactions)
cat(format(ops))
## mk_td("example_data", c(
## "mpg",
## "cyl",
## "disp",
## "car")) %.>%
## extend(.,
## mpg_cyl := mpg * cyl,
## mpg_disp := mpg * disp,
## cyl_disp := cyl * disp)
It is a critical advantage to work with sets of variables as standard values. In this case what we can do is limited only by the power of R
itself.
Note: we also supply an alias for :=
as %:=%
for those that don’t want to confuse this assignment with how the symbol is used in data.table
. Take care that :=
has the precedence-level of an assignment and %:=%
has the precedence-level of a user defined operator.
As, as always, our queries can be used on data.
if(have_rqdatatable) {
%.>%
example_data %.>%
ops ::kable(.)
knitr }
mpg | cyl | disp | car | mpg_cyl | mpg_disp | cyl_disp |
---|---|---|---|---|---|---|
21.0 | 6 | 160 | Mazda RX4 | 126.0 | 3360.0 | 960 |
21.0 | 6 | 160 | Mazda RX4 Wag | 126.0 | 3360.0 | 960 |
22.8 | 4 | 108 | Datsun 710 | 91.2 | 2462.4 | 432 |
21.4 | 6 | 258 | Hornet 4 Drive | 128.4 | 5521.2 | 1548 |
18.7 | 8 | 360 | Hornet Sportabout | 149.6 | 6732.0 | 2880 |
18.1 | 6 | 225 | Valiant | 108.6 | 4072.5 | 1350 |
Or even in a database.
<- requireNamespace("DBI", quietly = TRUE) &&
have_db requireNamespace("RSQLite", quietly = TRUE)
if(have_db) {
<- DBI::dbConnect(RSQLite::SQLite(),
raw_connection ":memory:")
::initExtension(raw_connection)
RSQLite<- rquery_db_info(
db connection = raw_connection,
is_dbi = TRUE,
connection_options = rq_connection_tests(raw_connection))
rq_copy_to(db, 'example_data',
example_data,temporary = TRUE,
overwrite = TRUE)
<- to_sql(ops, db)
sql
cat(format(sql))
}
## SELECT
## `mpg`,
## `cyl`,
## `disp`,
## `car`,
## `mpg` * `cyl` AS `mpg_cyl`,
## `mpg` * `disp` AS `mpg_disp`,
## `cyl` * `disp` AS `cyl_disp`
## FROM (
## SELECT
## `mpg`,
## `cyl`,
## `disp`,
## `car`
## FROM
## `example_data`
## ) tsql_73730651641561315786_0000000000
if(have_db) {
<- materialize(db, ops)
res_table ::dbReadTable(raw_connection, res_table$table_name) %.>%
DBI::kable(.)
knitr }
mpg | cyl | disp | car | mpg_cyl | mpg_disp | cyl_disp |
---|---|---|---|---|---|---|
21.0 | 6 | 160 | Mazda RX4 | 126.0 | 3360.0 | 960 |
21.0 | 6 | 160 | Mazda RX4 Wag | 126.0 | 3360.0 | 960 |
22.8 | 4 | 108 | Datsun 710 | 91.2 | 2462.4 | 432 |
21.4 | 6 | 258 | Hornet 4 Drive | 128.4 | 5521.2 | 1548 |
18.7 | 8 | 360 | Hornet Sportabout | 149.6 | 6732.0 | 2880 |
18.1 | 6 | 225 | Valiant | 108.6 | 4072.5 | 1350 |
if(have_db) {
::dbDisconnect(raw_connection)
DBI }