Whenever I start working on a dataset, I need to take a glance at the data to see how the data are or if the format is the one that I am expecting. I found myself coding similar lines over and over again with each data set I put my hands on. I decided to put that lines together in an R package so I and others can use them. I called it glancedata
.
There are some already cool R packages to summarize information. Two of the best, in my opinion, are skimr
and GGally
. In this vignette, I provide examples of the functions in glancedata
as well as some of the functions in these two packages.
Below is a table with the functions shown in this vignette.
Package | Function | Description |
---|---|---|
skimr |
skim |
Alternative to summary . Friendly with dplyr::group_by() . |
glancedata |
glance_data |
Alternative to summary . Emphasizes missing data and binary variables. |
glancedata |
glance_data_in_workbook |
Similar to glance_data . Creates list of dataframes instead and saves XLSX file. |
glancedata |
plot_numerical_vars |
Creates a plot per numerical variable. It might be histogram, density plot, qqplot, violin plot or scatterplot. |
glancedata |
plot_discrete_vars |
Creates a plot per variable with up to 20 different values. This limit can be adjusted.. |
GGally |
ggpairs |
Creates plots for pairwise comparison of columns. |
USArrests
)I am going to use a built-in dataset in R
. I added some columns to it so you we can see what happens with different type of columns. You may load your own dataset instead of sample_data
.
The example we are going to use is USArrests
, which contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas, according to the data set description (type help("USArrests")
in the console to see more details).
library(dplyr)
library(tidyr)
library(knitr)
sample_data <-
tibble(State = state.name,
Region = state.region) %>%
bind_cols(as_tibble(state.x77)) %>%
bind_cols(USArrests)
kable(head(sample_data))
State | Region | Population | Income | Illiteracy | Life Exp | Murder | HS Grad | Frost | Area | Murder1 | Assault | UrbanPop | Rape |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alabama | South | 3615 | 3624 | 2.1 | 69.05 | 15.1 | 41.3 | 20 | 50708 | 13.2 | 236 | 58 | 21.2 |
Alaska | West | 365 | 6315 | 1.5 | 69.31 | 11.3 | 66.7 | 152 | 566432 | 10.0 | 263 | 48 | 44.5 |
Arizona | West | 2212 | 4530 | 1.8 | 70.55 | 7.8 | 58.1 | 15 | 113417 | 8.1 | 294 | 80 | 31.0 |
Arkansas | South | 2110 | 3378 | 1.9 | 70.66 | 10.1 | 39.9 | 65 | 51945 | 8.8 | 190 | 50 | 19.5 |
California | West | 21198 | 5114 | 1.1 | 71.71 | 10.3 | 62.6 | 20 | 156361 | 9.0 | 276 | 91 | 40.6 |
Colorado | West | 2541 | 4884 | 0.7 | 72.06 | 6.8 | 63.9 | 166 | 103766 | 7.9 | 204 | 78 | 38.7 |
skimr
There are many packages useful
Name | sample_data |
Number of rows | 50 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 1 |
factor | 1 |
numeric | 12 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
State | 0 | 1 | 4 | 14 | 0 | 50 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Region | 0 | 1 | FALSE | 4 | Sou: 16, Wes: 13, Nor: 12, Nor: 9 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Population | 0 | 1 | 4246.42 | 4464.49 | 365.00 | 1079.50 | 2838.50 | 4968.50 | 21198.0 | ▇▂▁▁▁ |
Income | 0 | 1 | 4435.80 | 614.47 | 3098.00 | 3992.75 | 4519.00 | 4813.50 | 6315.0 | ▃▅▇▂▁ |
Illiteracy | 0 | 1 | 1.17 | 0.61 | 0.50 | 0.62 | 0.95 | 1.58 | 2.8 | ▇▃▂▂▁ |
Life Exp | 0 | 1 | 70.88 | 1.34 | 67.96 | 70.12 | 70.67 | 71.89 | 73.6 | ▃▃▇▅▅ |
Murder | 0 | 1 | 7.38 | 3.69 | 1.40 | 4.35 | 6.85 | 10.67 | 15.1 | ▆▇▃▇▂ |
HS Grad | 0 | 1 | 53.11 | 8.08 | 37.80 | 48.05 | 53.25 | 59.15 | 67.3 | ▅▂▇▆▃ |
Frost | 0 | 1 | 104.46 | 51.98 | 0.00 | 66.25 | 114.50 | 139.75 | 188.0 | ▅▃▅▇▆ |
Area | 0 | 1 | 70735.88 | 85327.30 | 1049.00 | 36985.25 | 54277.00 | 81162.50 | 566432.0 | ▇▁▁▁▁ |
Murder1 | 0 | 1 | 7.79 | 4.36 | 0.80 | 4.08 | 7.25 | 11.25 | 17.4 | ▇▇▅▅▃ |
Assault | 0 | 1 | 170.76 | 83.34 | 45.00 | 109.00 | 159.00 | 249.00 | 337.0 | ▆▇▃▅▃ |
UrbanPop | 0 | 1 | 65.54 | 14.47 | 32.00 | 54.50 | 66.00 | 77.75 | 91.0 | ▁▆▇▅▆ |
Rape | 0 | 1 | 21.23 | 9.37 | 7.30 | 15.08 | 20.10 | 26.17 | 46.0 | ▆▇▅▂▂ |
glance_data
library(glancedata)
glance_data(sample_data)
#> # A tibble: 14 x 11
#> name type distinct_values minimum median maximum mean sd
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 State cate~ 50 NA NA NA NA NA
#> 2 Regi~ fact~ 4 NA NA NA NA NA
#> 3 Popu~ nume~ 50 365 2838. 2.12e4 4.25e3 4.46e+3
#> 4 Inco~ nume~ 50 3098 4519 6.32e3 4.44e3 6.14e+2
#> 5 Illi~ nume~ 20 0.5 0.95 2.80e0 1.17e0 6.10e-1
#> 6 Life~ nume~ 47 68.0 70.7 7.36e1 7.09e1 1.34e+0
#> 7 Murd~ nume~ 44 1.4 6.85 1.51e1 7.38e0 3.69e+0
#> 8 HS G~ nume~ 47 37.8 53.2 6.73e1 5.31e1 8.08e+0
#> 9 Frost nume~ 43 0 114. 1.88e2 1.04e2 5.20e+1
#> 10 Area nume~ 50 1049 54277 5.66e5 7.07e4 8.53e+4
#> 11 Murd~ nume~ 43 0.8 7.25 1.74e1 7.79e0 4.36e+0
#> 12 Assa~ nume~ 45 45 159 3.37e2 1.71e2 8.33e+1
#> 13 Urba~ nume~ 36 32 66 9.10e1 6.55e1 1.45e+1
#> 14 Rape nume~ 48 7.3 20.1 4.60e1 2.12e1 9.37e+0
#> # ... with 3 more variables: na_proportion <dbl>, count <chr>,
#> # sample_values <chr>
glance_data_in_workbook
library(glancedata)
glance_data_in_workbook(sample_data)
#> $all
#> # A tibble: 14 x 11
#> name type distinct_values minimum median maximum mean sd
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 State cate~ 50 NA NA NA NA NA
#> 2 Regi~ fact~ 4 NA NA NA NA NA
#> 3 Popu~ nume~ 50 365 2838. 2.12e4 4.25e3 4.46e+3
#> 4 Inco~ nume~ 50 3098 4519 6.32e3 4.44e3 6.14e+2
#> 5 Illi~ nume~ 20 0.5 0.95 2.80e0 1.17e0 6.10e-1
#> 6 Life~ nume~ 47 68.0 70.7 7.36e1 7.09e1 1.34e+0
#> 7 Murd~ nume~ 44 1.4 6.85 1.51e1 7.38e0 3.69e+0
#> 8 HS G~ nume~ 47 37.8 53.2 6.73e1 5.31e1 8.08e+0
#> 9 Frost nume~ 43 0 114. 1.88e2 1.04e2 5.20e+1
#> 10 Area nume~ 50 1049 54277 5.66e5 7.07e4 8.53e+4
#> 11 Murd~ nume~ 43 0.8 7.25 1.74e1 7.79e0 4.36e+0
#> 12 Assa~ nume~ 45 45 159 3.37e2 1.71e2 8.33e+1
#> 13 Urba~ nume~ 36 32 66 9.10e1 6.55e1 1.45e+1
#> 14 Rape nume~ 48 7.3 20.1 4.60e1 2.12e1 9.37e+0
#> # ... with 3 more variables: na_proportion <dbl>, count <chr>,
#> # sample_values <chr>
#>
#> $summary
#> # A tibble: 2 x 2
#> cat n
#> <chr> <int>
#> 1 categorical 2
#> 2 numerical 12
#>
#> $all_nas
#> # A tibble: 0 x 6
#> # ... with 6 variables: name <chr>, type <chr>, distinct_values <int>,
#> # na_proportion <dbl>, count <chr>, sample_values <chr>
#>
#> $single_value
#> # A tibble: 0 x 11
#> # ... with 11 variables: name <chr>, type <chr>, distinct_values <int>,
#> # minimum <dbl>, median <dbl>, maximum <dbl>, mean <dbl>, sd <dbl>,
#> # na_proportion <dbl>, count <chr>, sample_values <chr>
#>
#> $binary
#> # A tibble: 0 x 11
#> # ... with 11 variables: name <chr>, type <chr>, distinct_values <int>,
#> # minimum <dbl>, median <dbl>, maximum <dbl>, mean <dbl>, sd <dbl>,
#> # na_proportion <dbl>, count <chr>, sample_values <chr>
#>
#> $numerical
#> # A tibble: 12 x 10
#> name type distinct_values minimum median maximum mean sd
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Popu~ nume~ 50 365 2.84e+3 2.12e4 4.25e3 4.46e+3
#> 2 Inco~ nume~ 50 3098 4.52e+3 6.32e3 4.44e3 6.14e+2
#> 3 Illi~ nume~ 20 0.5 9.50e-1 2.80e0 1.17e0 6.10e-1
#> 4 Life~ nume~ 47 68.0 7.07e+1 7.36e1 7.09e1 1.34e+0
#> 5 Murd~ nume~ 44 1.4 6.85e+0 1.51e1 7.38e0 3.69e+0
#> 6 HS G~ nume~ 47 37.8 5.32e+1 6.73e1 5.31e1 8.08e+0
#> 7 Frost nume~ 43 0 1.14e+2 1.88e2 1.04e2 5.20e+1
#> 8 Area nume~ 50 1049 5.43e+4 5.66e5 7.07e4 8.53e+4
#> 9 Murd~ nume~ 43 0.8 7.25e+0 1.74e1 7.79e0 4.36e+0
#> 10 Assa~ nume~ 45 45 1.59e+2 3.37e2 1.71e2 8.33e+1
#> 11 Urba~ nume~ 36 32 6.60e+1 9.10e1 6.55e1 1.45e+1
#> 12 Rape nume~ 48 7.3 2.01e+1 4.60e1 2.12e1 9.37e+0
#> # ... with 2 more variables: na_proportion <dbl>, sample_values <chr>
#>
#> $categorical
#> # A tibble: 2 x 5
#> name distinct_values na_proportion count sample_values
#> <chr> <int> <dbl> <chr> <chr>
#> 1 State 50 0 Too many unique ~ Alabama, Alaska, Arizo~
#> 2 Region 4 0 North Central: 1~ South, West, Northeast~