Coming from ‘tidyverse’

Introduction

{datawizard} package aims to make basic data wrangling easier than with base R. The data wrangling workflow it supports is similar to the one supported by the tidyverse package combination of {dplyr} and {tidyr}. However, one of its main features is that it has a very few dependencies: {stats} and {utils} (included in base R) and {insight}, which is the core package of the easystats ecosystem. This package grew organically to simultaneously satisfy the “0 non-base hard dependency” principle of easystats and the data wrangling needs of the constituent packages in this ecosystem.

One drawback of this genesis is that not all features of the {tidyverse} packages are supported since only features that were necessary for easystats ecosystem have been implemented. Some of these missing features (such as summarize or the pipe operator %>%) are made available in other dependency-free packages, such as {poorman}. It is also important to note that {datawizard} was designed to avoid namespace collisions with {tidyverse} packages.

In this article, we will see how to go through basic data wrangling steps with {datawizard}. We will also compare it to the {tidyverse} syntax for achieving the same. This way, if you decide to make the switch, you can easily find the translations here. This vignette is largely inspired from {dplyr}’s Getting started vignette.

library(dplyr)
library(tidyr)
library(datawizard)

Workhorses

Before we look at their tidyverse equivalents, we can first have a look at {datawizard}’s key functions for data wrangling:

Function Operation
data_filter() to select only certain observations
data_select() to select only a few variables
data_extract() to extract a single variable
data_rename() to rename variables
data_relocate() to reorder a data frame
data_to_long() to convert data from wide to long
data_to_wide() to convert data from long to wide
data_join() to join two data frames

Note that there are a few functions in {datawizard} that have no strict equivalent in {dplyr} or {tidyr} (e.g data_rotate()), and so we won’t discuss them in the next section.

Equivalence with {dplyr} / {tidyr}

Before we look at them individually, let’s first have a look at the summary table of this equivalence.

Function Tidyverse equivalent(s)
data_filter() dplyr::filter()
data_select() dplyr::select()
data_extract() dplyr::pull()
data_rename() dplyr::rename()
data_relocate() dplyr::relocate()
data_to_long() tidyr::pivot_longer()
data_to_wide() tidyr::pivot_wider()
data_join() dplyr::inner_join(), dplyr::left_join(), dplyr::right_join(),
dplyr::full_join(), dplyr::anti_join(), dplyr::semi_join()

Filtering

data_filter() is a wrapper around subset(). Therefore, if you want to have several filtering conditions, you need to use &. Separating the conditions with a comma (as in dplyr::filter()) will not work; it will only apply the first condition.

# ---------- datawizard -----------
starwars %>%
  data_filter(skin_color == "light" &
    eye_color == "brown")
# ---------- tidyverse -----------
starwars %>%
  filter(
    skin_color == "light",
    eye_color == "brown"
  )
## # A tibble: 7 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Leia Organa     150    49 brown   light   brown        19 fema… femin… Aldera…
## 2 Biggs Darkl…    183    84 black   light   brown        24 male  mascu… Tatooi…
## 3 Cordé           157    NA brown   light   brown        NA fema… femin… Naboo  
## 4 Dormé           165    NA brown   light   brown        NA fema… femin… Naboo  
## 5 Raymus Anti…    188    79 brown   light   brown        NA male  mascu… Aldera…
## 6 Poe Dameron      NA    NA brown   light   brown        NA male  mascu… <NA>   
## 7 Padmé Amida…    165    45 brown   light   brown        46 fema… femin… Naboo  
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld
## # ℹ Use `colnames()` to see all variable names

Selecting

data_select() is the equivalent of dplyr::select(). The main difference between these two functions is that data_select() uses two arguments (select and exclude) and requires quoted column names if we want to select several variables, while dplyr::select() accepts any unquoted column names.

# ---------- datawizard -----------
starwars %>%
  data_select(select = c("hair_color", "skin_color", "eye_color"))
# ---------- tidyverse -----------
starwars %>%
  select(hair_color, skin_color, eye_color)
## # A tibble: 6 × 3
##   hair_color  skin_color  eye_color
## * <chr>       <chr>       <chr>    
## 1 blond       fair        blue     
## 2 <NA>        gold        yellow   
## 3 <NA>        white, blue red      
## 4 none        white       yellow   
## 5 brown       light       brown    
## 6 brown, grey light       blue
# ---------- datawizard -----------
starwars %>%
  data_select(select = -ends_with("color"))
# ---------- tidyverse -----------
starwars %>%
  select(-ends_with("color"))
## # A tibble: 6 × 11
##   name   height  mass birth…¹ sex   gender homew…² species films vehic…³ stars…⁴
## * <chr>   <int> <dbl>   <dbl> <chr> <chr>  <chr>   <chr>   <lis> <list>  <list> 
## 1 Luke …    172    77    19   male  mascu… Tatooi… Human   <chr> <chr>   <chr>  
## 2 C-3PO     167    75   112   none  mascu… Tatooi… Droid   <chr> <chr>   <chr>  
## 3 R2-D2      96    32    33   none  mascu… Naboo   Droid   <chr> <chr>   <chr>  
## 4 Darth…    202   136    41.9 male  mascu… Tatooi… Human   <chr> <chr>   <chr>  
## 5 Leia …    150    49    19   fema… femin… Aldera… Human   <chr> <chr>   <chr>  
## 6 Owen …    178   120    52   male  mascu… Tatooi… Human   <chr> <chr>   <chr>  
## # … with abbreviated variable names ¹​birth_year, ²​homeworld, ³​vehicles,
## #   ⁴​starships
# ---------- datawizard -----------
starwars %>%
  data_select(select = -hair_color:eye_color)
# ---------- tidyverse -----------
starwars %>%
  select(!(hair_color:eye_color))
## # A tibble: 6 × 11
##   name   height  mass birth…¹ sex   gender homew…² species films vehic…³ stars…⁴
## * <chr>   <int> <dbl>   <dbl> <chr> <chr>  <chr>   <chr>   <lis> <list>  <list> 
## 1 Luke …    172    77    19   male  mascu… Tatooi… Human   <chr> <chr>   <chr>  
## 2 C-3PO     167    75   112   none  mascu… Tatooi… Droid   <chr> <chr>   <chr>  
## 3 R2-D2      96    32    33   none  mascu… Naboo   Droid   <chr> <chr>   <chr>  
## 4 Darth…    202   136    41.9 male  mascu… Tatooi… Human   <chr> <chr>   <chr>  
## 5 Leia …    150    49    19   fema… femin… Aldera… Human   <chr> <chr>   <chr>  
## 6 Owen …    178   120    52   male  mascu… Tatooi… Human   <chr> <chr>   <chr>  
## # … with abbreviated variable names ¹​birth_year, ²​homeworld, ³​vehicles,
## #   ⁴​starships
# ---------- datawizard -----------
starwars %>%
  data_select(exclude = regex("color$"))
# ---------- tidyverse -----------
starwars %>%
  select(-contains("color$"))
## # A tibble: 6 × 11
##   name   height  mass birth…¹ sex   gender homew…² species films vehic…³ stars…⁴
## * <chr>   <int> <dbl>   <dbl> <chr> <chr>  <chr>   <chr>   <lis> <list>  <list> 
## 1 Luke …    172    77    19   male  mascu… Tatooi… Human   <chr> <chr>   <chr>  
## 2 C-3PO     167    75   112   none  mascu… Tatooi… Droid   <chr> <chr>   <chr>  
## 3 R2-D2      96    32    33   none  mascu… Naboo   Droid   <chr> <chr>   <chr>  
## 4 Darth…    202   136    41.9 male  mascu… Tatooi… Human   <chr> <chr>   <chr>  
## 5 Leia …    150    49    19   fema… femin… Aldera… Human   <chr> <chr>   <chr>  
## 6 Owen …    178   120    52   male  mascu… Tatooi… Human   <chr> <chr>   <chr>  
## # … with abbreviated variable names ¹​birth_year, ²​homeworld, ³​vehicles,
## #   ⁴​starships
# ---------- datawizard -----------
starwars %>%
  data_select(select = is.numeric)
# ---------- tidyverse -----------
starwars %>%
  select(where(is.numeric))
## # A tibble: 6 × 3
##   height  mass birth_year
## *  <int> <dbl>      <dbl>
## 1    172    77       19  
## 2    167    75      112  
## 3     96    32       33  
## 4    202   136       41.9
## 5    150    49       19  
## 6    178   120       52

You can find a list of all the select helpers with ?data_select.

Extracting

Although we mostly work on dataframes, it is sometimes useful to extract a single column as a vector. This can be done with data_extract(), which reproduces the behavior of dplyr::pull():

# ---------- datawizard -----------
starwars %>%
  data_extract(gender)
# ---------- tidyverse -----------
starwars %>%
  pull(gender)
## [1] "masculine" "masculine" "masculine" "masculine" "feminine"  "masculine"

We can also specify several variables in select. In this case, data_extract() is equivalent to data_select():

starwars %>%
  data_extract(select = contains("color"))
## # A tibble: 6 × 3
##   hair_color  skin_color  eye_color
##   <chr>       <chr>       <chr>    
## 1 blond       fair        blue     
## 2 <NA>        gold        yellow   
## 3 <NA>        white, blue red      
## 4 none        white       yellow   
## 5 brown       light       brown    
## 6 brown, grey light       blue

Renaming

data_rename() is the equivalent of dplyr::rename() but the syntax between the two is different. While dplyr::rename() takes new-old pairs of column names, data_rename() requires a vector of column names to rename, and then a vector of new names for these columns that must be of the same length.

# ---------- datawizard -----------
starwars %>%
  data_rename(
    pattern = c("sex", "hair_color"),
    replacement = c("Sex", "Hair Color")
  )
# ---------- tidyverse -----------
starwars %>%
  rename(
    Sex = sex,
    "Hair Color" = hair_color
  )
## # A tibble: 6 × 14
##   name         height  mass Hair …¹ skin_…² eye_c…³ birth…⁴ Sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
## 2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
## 3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  
## 4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
## 5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
## 6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​`Hair Color`,
## #   ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
## # ℹ Use `colnames()` to see all variable names

The way data_rename() is designed makes it easy to apply the same modifications to a vector of column names. For example, we can remove underscores and use TitleCase with the following code:

to_rename <- names(starwars)

starwars %>%
  data_rename(
    pattern = to_rename,
    replacement = tools::toTitleCase(gsub("_", " ", to_rename))
  )
## # A tibble: 6 × 14
##   Name         Height  Mass Hair …¹ Skin …² Eye C…³ Birth…⁴ Sex   Gender Homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
## 2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
## 3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  
## 4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
## 5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
## 6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
## # … with 4 more variables: Species <chr>, Films <list>, Vehicles <list>,
## #   Starships <list>, and abbreviated variable names ¹​`Hair Color`,
## #   ²​`Skin Color`, ³​`Eye Color`, ⁴​`Birth Year`, ⁵​Homeworld
## # ℹ Use `colnames()` to see all variable names

It is also possible to add a prefix or a suffix to all or a subset of variables with data_addprefix() and data_addsuffix(). The argument select accepts all select helpers that we saw above with data_select():

starwars %>%
  data_addprefix(
    pattern = "OLD.",
    select = contains("color")
  ) %>%
  data_addsuffix(
    pattern = ".NEW",
    select = -contains("color")
  )
## # A tibble: 6 × 14
##   name.NEW       heigh…¹ mass.…² OLD.h…³ OLD.s…⁴ OLD.e…⁵ birth…⁶ sex.NEW gende…⁷
##   <chr>            <int>   <dbl> <chr>   <chr>   <chr>     <dbl> <chr>   <chr>  
## 1 Luke Skywalker     172      77 blond   fair    blue       19   male    mascul…
## 2 C-3PO              167      75 <NA>    gold    yellow    112   none    mascul…
## 3 R2-D2               96      32 <NA>    white,… red        33   none    mascul…
## 4 Darth Vader        202     136 none    white   yellow     41.9 male    mascul…
## 5 Leia Organa        150      49 brown   light   brown      19   female  femini…
## 6 Owen Lars          178     120 brown,… light   blue       52   male    mascul…
## # … with 5 more variables: homeworld.NEW <chr>, species.NEW <chr>,
## #   films.NEW <list>, vehicles.NEW <list>, starships.NEW <list>, and
## #   abbreviated variable names ¹​height.NEW, ²​mass.NEW, ³​OLD.hair_color,
## #   ⁴​OLD.skin_color, ⁵​OLD.eye_color, ⁶​birth_year.NEW, ⁷​gender.NEW
## # ℹ Use `colnames()` to see all variable names

Relocating

Sometimes, we want to relocate one or a small subset of columns in the dataset. Rather than typing many names in data_select(), we can use data_relocate(), which is the equivalent of dplyr::relocate(). Just like data_select(), we can specify a list of variables we want to relocate with select and exclude. Then, the arguments before and after1 specify where the selected columns should be relocated:

# ---------- datawizard -----------
starwars %>%
  data_relocate(sex:homeworld, before = "height")
# ---------- tidyverse -----------
starwars %>%
  relocate(sex:homeworld, .before = height)
## # A tibble: 6 × 14
##   name         sex   gender homew…¹ height  mass hair_…² skin_…³ eye_c…⁴ birth…⁵
## * <chr>        <chr> <chr>  <chr>    <int> <dbl> <chr>   <chr>   <chr>     <dbl>
## 1 Luke Skywal… male  mascu… Tatooi…    172    77 blond   fair    blue       19  
## 2 C-3PO        none  mascu… Tatooi…    167    75 <NA>    gold    yellow    112  
## 3 R2-D2        none  mascu… Naboo       96    32 <NA>    white,… red        33  
## 4 Darth Vader  male  mascu… Tatooi…    202   136 none    white   yellow     41.9
## 5 Leia Organa  fema… femin… Aldera…    150    49 brown   light   brown      19  
## 6 Owen Lars    male  mascu… Tatooi…    178   120 brown,… light   blue       52  
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​homeworld, ²​hair_color,
## #   ³​skin_color, ⁴​eye_color, ⁵​birth_year
## # ℹ Use `colnames()` to see all variable names

In addition to column names, before and after accept column indices. Finally, one can use before = -1 to relocate the selected columns just before the last column, or after = -1 to relocate them after the last column.

# ---------- datawizard -----------
starwars %>%
  data_relocate(sex:homeworld, after = -1)
## # A tibble: 6 × 14
##   name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ species films vehic…⁵
## * <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr>   <lis> <list> 
## 1 Luke Skywa…    172    77 blond   fair    blue       19   Human   <chr> <chr>  
## 2 C-3PO          167    75 <NA>    gold    yellow    112   Droid   <chr> <chr>  
## 3 R2-D2           96    32 <NA>    white,… red        33   Droid   <chr> <chr>  
## 4 Darth Vader    202   136 none    white   yellow     41.9 Human   <chr> <chr>  
## 5 Leia Organa    150    49 brown   light   brown      19   Human   <chr> <chr>  
## 6 Owen Lars      178   120 brown,… light   blue       52   Human   <chr> <chr>  
## # … with 4 more variables: starships <list>, sex <chr>, gender <chr>,
## #   homeworld <chr>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​vehicles
## # ℹ Use `colnames()` to see all variable names

Reshaping

Longer

Reshaping data from wide to long or from long to wide format can be done with data_to_long() and data_to_wide(). These functions were designed to match tidyr::pivot_longer() and tidyr::pivot_wider() arguments, so that the only thing to do is to change the function name. However, not all of tidyr::pivot_longer() and tidyr::pivot_wider() features are available yet.

We will use the relig_income dataset, as in the {tidyr} vignette.

relig_income
## # A tibble: 18 × 11
##    religion      `<$10k` $10-2…¹ $20-3…² $30-4…³ $40-5…⁴ $50-7…⁵ $75-1…⁶ $100-…⁷
##    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Agnostic           27      34      60      81      76     137     122     109
##  2 Atheist            12      27      37      52      35      70      73      59
##  3 Buddhist           27      21      30      34      33      58      62      39
##  4 Catholic          418     617     732     670     638    1116     949     792
##  5 Don’t know/r…      15      14      15      11      10      35      21      17
##  6 Evangelical …     575     869    1064     982     881    1486     949     723
##  7 Hindu               1       9       7       9      11      34      47      48
##  8 Historically…     228     244     236     238     197     223     131      81
##  9 Jehovah's Wi…      20      27      24      24      21      30      15      11
## 10 Jewish             19      19      25      25      30      95      69      87
## 11 Mainline Prot     289     495     619     655     651    1107     939     753
## 12 Mormon             29      40      48      51      56     112      85      49
## 13 Muslim              6       7       9      10       9      23      16       8
## 14 Orthodox           13      17      23      32      32      47      38      42
## 15 Other Christ…       9       7      11      13      13      14      18      14
## 16 Other Faiths       20      33      40      46      49      63      46      40
## 17 Other World …       5       2       3       4       2       7       3       4
## 18 Unaffiliated      217     299     374     365     341     528     407     321
## # … with 2 more variables: `>150k` <dbl>, `Don't know/refused` <dbl>, and
## #   abbreviated variable names ¹​`$10-20k`, ²​`$20-30k`, ³​`$30-40k`, ⁴​`$40-50k`,
## #   ⁵​`$50-75k`, ⁶​`$75-100k`, ⁷​`$100-150k`
## # ℹ Use `colnames()` to see all variable names

We would like to reshape this dataset to have 3 columns: religion, count, and income. The column “religion” doesn’t need to change, so we exclude it with -religion. Then, each remaining column corresponds to an income category. Therefore, we want to move all these column names to a single column called “income”. Finally, the values corresponding to each of these columns will be reshaped to be in a single new column, called “count”.

# ---------- datawizard -----------
relig_income %>%
  data_to_long(
    -religion,
    names_to = "income",
    values_to = "count"
  )
# ---------- tidyverse -----------
relig_income %>%
  pivot_longer(
    !religion,
    names_to = "income",
    values_to = "count"
  )
## # A tibble: 180 × 3
##    religion income             count
##  * <chr>    <chr>              <dbl>
##  1 Agnostic <$10k                 27
##  2 Agnostic $10-20k               34
##  3 Agnostic $20-30k               60
##  4 Agnostic $30-40k               81
##  5 Agnostic $40-50k               76
##  6 Agnostic $50-75k              137
##  7 Agnostic $75-100k             122
##  8 Agnostic $100-150k            109
##  9 Agnostic >150k                 84
## 10 Agnostic Don't know/refused    96
## # … with 170 more rows
## # ℹ Use `print(n = ...)` to see more rows

To explore a bit more the arguments of data_to_long(), we will use another dataset: the billboard dataset.

billboard
## # A tibble: 317 × 79
##    artist track date.ent…¹   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8   wk9
##    <chr>  <chr> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac  Baby… 2000-02-26    87    82    72    77    87    94    99    NA    NA
##  2 2Ge+h… The … 2000-09-02    91    87    92    NA    NA    NA    NA    NA    NA
##  3 3 Doo… Kryp… 2000-04-08    81    70    68    67    66    57    54    53    51
##  4 3 Doo… Loser 2000-10-21    76    76    72    69    67    65    55    59    62
##  5 504 B… Wobb… 2000-04-15    57    34    25    17    17    31    36    49    53
##  6 98^0   Give… 2000-08-19    51    39    34    26    26    19     2     2     3
##  7 A*Tee… Danc… 2000-07-08    97    97    96    95   100    NA    NA    NA    NA
##  8 Aaliy… I Do… 2000-01-29    84    62    51    41    38    35    35    38    38
##  9 Aaliy… Try … 2000-03-18    59    53    38    28    21    18    16    14    12
## 10 Adams… Open… 2000-08-26    76    76    74    69    68    67    61    58    57
## # … with 307 more rows, 67 more variables: wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
# ---------- datawizard -----------
billboard %>%
  data_to_long(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )
# ---------- tidyverse -----------
billboard %>%
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )
## # A tibble: 5,307 × 5
##    artist  track                   date.entered week   rank
##  * <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
## # … with 5,297 more rows
## # ℹ Use `print(n = ...)` to see more rows

Wider

Once again, we use an example in the {tidyr} vignette to show how close data_to_wide() and pivot_wider() are:

fish_encounters
## # A tibble: 114 × 3
##    fish  station  seen
##    <fct> <fct>   <int>
##  1 4842  Release     1
##  2 4842  I80_1       1
##  3 4842  Lisbon      1
##  4 4842  Rstr        1
##  5 4842  Base_TD     1
##  6 4842  BCE         1
##  7 4842  BCW         1
##  8 4842  BCE2        1
##  9 4842  BCW2        1
## 10 4842  MAE         1
## # … with 104 more rows
## # ℹ Use `print(n = ...)` to see more rows
# ---------- datawizard -----------
fish_encounters %>%
  data_to_wide(
    names_from = "station",
    values_from = "seen",
    values_fill = 0
  )
# ---------- tidyverse -----------
fish_encounters %>%
  pivot_wider(
    names_from = station,
    values_from = seen,
    values_fill = 0
  )
## # A tibble: 19 × 12
##    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
##  * <fct>   <dbl> <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 4842        1     1      1     1       1     1     1     1     1     1     1
##  2 4843        1     1      1     1       1     1     1     1     1     1     1
##  3 4844        1     1      1     1       1     1     1     1     1     1     1
##  4 4845        1     1      1     1       1     0     0     0     0     0     0
##  5 4847        1     1      1     0       0     0     0     0     0     0     0
##  6 4848        1     1      1     1       0     0     0     0     0     0     0
##  7 4849        1     1      0     0       0     0     0     0     0     0     0
##  8 4850        1     1      0     1       1     1     1     0     0     0     0
##  9 4851        1     1      0     0       0     0     0     0     0     0     0
## 10 4854        1     1      0     0       0     0     0     0     0     0     0
## 11 4855        1     1      1     1       1     0     0     0     0     0     0
## 12 4857        1     1      1     1       1     1     1     1     1     0     0
## 13 4858        1     1      1     1       1     1     1     1     1     1     1
## 14 4859        1     1      1     1       1     0     0     0     0     0     0
## 15 4861        1     1      1     1       1     1     1     1     1     1     1
## 16 4862        1     1      1     1       1     1     1     1     1     0     0
## 17 4863        1     1      0     0       0     0     0     0     0     0     0
## 18 4864        1     1      0     0       0     0     0     0     0     0     0
## 19 4865        1     1      1     0       0     0     0     0     0     0     0

Joining

In {datawizard}, joining datasets is done with data_join() (or its alias data_merge()). Contrary to {dplyr}, this unique function takes care of all types of join, which are then specified inside the function with the argument join (by default, join = "left").

Below, we show how to perform the four most common joins: full, left, right and inner. We will use the datasets band_membersand band_instruments provided by {dplyr}:

band_members
## # A tibble: 3 × 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments
## # A tibble: 3 × 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar

Full join

# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "full")
# ---------- tidyverse -----------
band_members %>%
  full_join(band_instruments)
## # A tibble: 4 × 3
##   name  band    plays 
## * <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

Left and right joins

# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "left")
# ---------- tidyverse -----------
band_members %>%
  left_join(band_instruments)
## # A tibble: 3 × 3
##   name  band    plays 
## * <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass
# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "right")
# ---------- tidyverse -----------
band_members %>%
  right_join(band_instruments)
## # A tibble: 3 × 3
##   name  band    plays 
## * <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass  
## 3 Keith <NA>    guitar

Inner join

# ---------- datawizard -----------
band_members %>%
  data_join(band_instruments, join = "inner")
# ---------- tidyverse -----------
band_members %>%
  inner_join(band_instruments)
## # A tibble: 2 × 3
##   name  band    plays 
## * <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass

Other useful functions

{datawizard} contains other functions that are not necessarily included in {dplyr} or {tidyr}. Some of them are inspired from the package janitor.

Work with rownames

We can convert a column in rownames and move rownames to a new column with rownames_as_column() and column_as_rownames():

mtcars <- head(mtcars)
mtcars
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars2 <- mtcars %>%
  rownames_as_column(var = "model")

mtcars2
##               model  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars2 %>%
  column_as_rownames(var = "model")
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Work with column names

When dealing with messy data, it is sometimes useful to use a row as column names, and vice versa. This can be done with row_to_colnames() and colnames_to_row().

x <- data.frame(
  X_1 = c(NA, "Title", 1:3),
  X_2 = c(NA, "Title2", 4:6)
)
x
##     X_1    X_2
## 1  <NA>   <NA>
## 2 Title Title2
## 3     1      4
## 4     2      5
## 5     3      6
x2 <- x %>%
  row_to_colnames(row = 2)
x2
##   Title Title2
## 1  <NA>   <NA>
## 3     1      4
## 4     2      5
## 5     3      6
x2 %>%
  colnames_to_row()
##       x1     x2
## 1  Title Title2
## 11  <NA>   <NA>
## 3      1      4
## 4      2      5
## 5      3      6

  1. Note that we use before and after whereas dplyr::relocate() uses .before and .after.↩︎