Complete Example

Guillermo Basulto-Elias

Whenever I start working on a dataset, I need to take a glance at the data to see how the data are or if the format is the one that I am expecting. I found myself coding similar lines over and over again with each data set I put my hands on. I decided to put that lines together in an R package so I and others can use them. I called it glancedata.

There are some already cool R packages to summarize information. Two of the best, in my opinion, are skimr and GGally. In this vignette, I provide examples of the functions in glancedata as well as some of the functions in these two packages.

Below is a table with the functions shown in this vignette.

Package	Function	Description
`skimr`	`skim`	Alternative to `summary`. Friendly with `dplyr::group_by()`.
`glancedata`	`glance_data`	Alternative to `summary`. Emphasizes missing data and binary variables.
`glancedata`	`glance_data_in_workbook`	Similar to `glance_data`. Creates list of dataframes instead and saves XLSX file.
`glancedata`	`plot_numerical_vars`	Creates a plot per numerical variable. It might be histogram, density plot, qqplot, violin plot or scatterplot.
`glancedata`	`plot_discrete_vars`	Creates a plot per variable with up to 20 different values. This limit can be adjusted..
`GGally`	`ggpairs`	Creates plots for pairwise comparison of columns.

skimr

skim

Alternative to summary. Friendly with dplyr::group_by().

glancedata

glance_data

Alternative to summary. Emphasizes missing data and binary variables.

glancedata

glance_data_in_workbook

Similar to glance_data. Creates list of dataframes instead and saves XLSX file.

glancedata

plot_numerical_vars

Creates a plot per numerical variable. It might be histogram, density plot, qqplot, violin plot or scatterplot.

glancedata

plot_discrete_vars

Creates a plot per variable with up to 20 different values. This limit can be adjusted..

GGally

ggpairs

Creates plots for pairwise comparison of columns.

Example (`USArrests`)

I am going to use a built-in dataset in R. I added some columns to it so you we can see what happens with different type of columns. You may load your own dataset instead of sample_data.

The example we are going to use is USArrests, which contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas, according to the data set description (type help("USArrests") in the console to see more details).

library(dplyr)
library(tidyr)
library(knitr)

sample_data <- 
  tibble(State = state.name,
         Region = state.region) %>%
  bind_cols(as_tibble(state.x77)) %>%
  bind_cols(USArrests)

kable(head(sample_data))

State	Region	Population	Income	Illiteracy	Life Exp	Murder	HS Grad	Frost	Area	Murder1	Assault	UrbanPop	Rape
Alabama	South	3615	3624	2.1	69.05	15.1	41.3	20	50708	13.2	236	58	21.2
Alaska	West	365	6315	1.5	69.31	11.3	66.7	152	566432	10.0	263	48	44.5
Arizona	West	2212	4530	1.8	70.55	7.8	58.1	15	113417	8.1	294	80	31.0
Arkansas	South	2110	3378	1.9	70.66	10.1	39.9	65	51945	8.8	190	50	19.5
California	West	21198	5114	1.1	71.71	10.3	62.6	20	156361	9.0	276	91	40.6
Colorado	West	2541	4884	0.7	72.06	6.8	63.9	166	103766	7.9	204	78	38.7

Use of `skimr`

There are many packages useful

## Load package
library(skimr)

## Call main function
skim(sample_data)

Data summary
Name	sample_data
Number of rows	50
Number of columns	14
_______________________
Column type frequency:
character	1
factor	1
numeric	12
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
State	0	1	4	14	0	50	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Region	0	1	FALSE	4	Sou: 16, Wes: 13, Nor: 12, Nor: 9

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Population	1	4246.42	4464.49	365.00	1079.50	2838.50	4968.50	21198.0	▇▂▁▁▁
Income	1	4435.80	614.47	3098.00	3992.75	4519.00	4813.50	6315.0	▃▅▇▂▁
Illiteracy	1	1.17	0.61	0.50	0.62	0.95	1.58	2.8	▇▃▂▂▁
Life Exp	1	70.88	1.34	67.96	70.12	70.67	71.89	73.6	▃▃▇▅▅
Murder	1	7.38	3.69	1.40	4.35	6.85	10.67	15.1	▆▇▃▇▂
HS Grad	1	53.11	8.08	37.80	48.05	53.25	59.15	67.3	▅▂▇▆▃
Frost	1	104.46	51.98	0.00	66.25	114.50	139.75	188.0	▅▃▅▇▆
Area	1	70735.88	85327.30	1049.00	36985.25	54277.00	81162.50	566432.0	▇▁▁▁▁
Murder1	1	7.79	4.36	0.80	4.08	7.25	11.25	17.4	▇▇▅▅▃
Assault	1	170.76	83.34	45.00	109.00	159.00	249.00	337.0	▆▇▃▅▃
UrbanPop	1	65.54	14.47	32.00	54.50	66.00	77.75	91.0	▁▆▇▅▆
Rape	1	21.23	9.37	7.30	15.08	20.10	26.17	46.0	▆▇▅▂▂

library(glancedata) glance_data(sample_data) #> # A tibble: 14 x 11 #> name type distinct_values minimum median maximum mean sd #> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 State cate~ 50 NA NA NA NA NA #> 2 Regi~ fact~ 4 NA NA NA NA NA #> 3 Popu~ nume~ 50 365 2838. 2.12e4 4.25e3 4.46e+3 #> 4 Inco~ nume~ 50 3098 4519 6.32e3 4.44e3 6.14e+2 #> 5 Illi~ nume~ 20 0.5 0.95 2.80e0 1.17e0 6.10e-1 #> 6 Life~ nume~ 47 68.0 70.7 7.36e1 7.09e1 1.34e+0 #> 7 Murd~ nume~ 44 1.4 6.85 1.51e1 7.38e0 3.69e+0 #> 8 HS G~ nume~ 47 37.8 53.2 6.73e1 5.31e1 8.08e+0 #> 9 Frost nume~ 43 0 114. 1.88e2 1.04e2 5.20e+1 #> 10 Area nume~ 50 1049 54277 5.66e5 7.07e4 8.53e+4 #> 11 Murd~ nume~ 43 0.8 7.25 1.74e1 7.79e0 4.36e+0 #> 12 Assa~ nume~ 45 45 159 3.37e2 1.71e2 8.33e+1 #> 13 Urba~ nume~ 36 32 66 9.10e1 6.55e1 1.45e+1 #> 14 Rape nume~ 48 7.3 20.1 4.60e1 2.12e1 9.37e+0 #> # ... with 3 more variables: na_proportion <dbl>, count <chr>, #> # sample_values <chr>

library(glancedata) glance_data_in_workbook(sample_data) #> $all #> # A tibble: 14 x 11 #> name type distinct_values minimum median maximum mean sd #> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 State cate~ 50 NA NA NA NA NA #> 2 Regi~ fact~ 4 NA NA NA NA NA #> 3 Popu~ nume~ 50 365 2838. 2.12e4 4.25e3 4.46e+3 #> 4 Inco~ nume~ 50 3098 4519 6.32e3 4.44e3 6.14e+2 #> 5 Illi~ nume~ 20 0.5 0.95 2.80e0 1.17e0 6.10e-1 #> 6 Life~ nume~ 47 68.0 70.7 7.36e1 7.09e1 1.34e+0 #> 7 Murd~ nume~ 44 1.4 6.85 1.51e1 7.38e0 3.69e+0 #> 8 HS G~ nume~ 47 37.8 53.2 6.73e1 5.31e1 8.08e+0 #> 9 Frost nume~ 43 0 114. 1.88e2 1.04e2 5.20e+1 #> 10 Area nume~ 50 1049 54277 5.66e5 7.07e4 8.53e+4 #> 11 Murd~ nume~ 43 0.8 7.25 1.74e1 7.79e0 4.36e+0 #> 12 Assa~ nume~ 45 45 159 3.37e2 1.71e2 8.33e+1 #> 13 Urba~ nume~ 36 32 66 9.10e1 6.55e1 1.45e+1 #> 14 Rape nume~ 48 7.3 20.1 4.60e1 2.12e1 9.37e+0 #> # ... with 3 more variables: na_proportion <dbl>, count <chr>, #> # sample_values <chr> #> #> $summary #> # A tibble: 2 x 2 #> cat n #> <chr> <int> #> 1 categorical 2 #> 2 numerical 12 #> #> $all_nas #> # A tibble: 0 x 6 #> # ... with 6 variables: name <chr>, type <chr>, distinct_values <int>, #> # na_proportion <dbl>, count <chr>, sample_values <chr> #> #> $single_value #> # A tibble: 0 x 11 #> # ... with 11 variables: name <chr>, type <chr>, distinct_values <int>, #> # minimum <dbl>, median <dbl>, maximum <dbl>, mean <dbl>, sd <dbl>, #> # na_proportion <dbl>, count <chr>, sample_values <chr> #> #> $binary #> # A tibble: 0 x 11 #> # ... with 11 variables: name <chr>, type <chr>, distinct_values <int>, #> # minimum <dbl>, median <dbl>, maximum <dbl>, mean <dbl>, sd <dbl>, #> # na_proportion <dbl>, count <chr>, sample_values <chr> #> #> $numerical #> # A tibble: 12 x 10 #> name type distinct_values minimum median maximum mean sd #> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Popu~ nume~ 50 365 2.84e+3 2.12e4 4.25e3 4.46e+3 #> 2 Inco~ nume~ 50 3098 4.52e+3 6.32e3 4.44e3 6.14e+2 #> 3 Illi~ nume~ 20 0.5 9.50e-1 2.80e0 1.17e0 6.10e-1 #> 4 Life~ nume~ 47 68.0 7.07e+1 7.36e1 7.09e1 1.34e+0 #> 5 Murd~ nume~ 44 1.4 6.85e+0 1.51e1 7.38e0 3.69e+0 #> 6 HS G~ nume~ 47 37.8 5.32e+1 6.73e1 5.31e1 8.08e+0 #> 7 Frost nume~ 43 0 1.14e+2 1.88e2 1.04e2 5.20e+1 #> 8 Area nume~ 50 1049 5.43e+4 5.66e5 7.07e4 8.53e+4 #> 9 Murd~ nume~ 43 0.8 7.25e+0 1.74e1 7.79e0 4.36e+0 #> 10 Assa~ nume~ 45 45 1.59e+2 3.37e2 1.71e2 8.33e+1 #> 11 Urba~ nume~ 36 32 6.60e+1 9.10e1 6.55e1 1.45e+1 #> 12 Rape nume~ 48 7.3 2.01e+1 4.60e1 2.12e1 9.37e+0 #> # ... with 2 more variables: na_proportion <dbl>, sample_values <chr> #> #> $categorical #> # A tibble: 2 x 5 #> name distinct_values na_proportion count sample_values #> <chr> <int> <dbl> <chr> <chr> #> 1 State 50 0 Too many unique ~ Alabama, Alaska, Arizo~ #> 2 Region 4 0 North Central: 1~ South, West, Northeast~

Complete Example

Guillermo Basulto-Elias

Example (`USArrests`)

Use of `skimr`

`glance_data`

`glance_data_in_workbook`

Testing a date mode

Future versions

Complete Example

Guillermo Basulto-Elias

Example (USArrests)

Use of skimr

glance_data

glance_data_in_workbook

Testing a date mode

Future versions

Example (`USArrests`)

Use of `skimr`

`glance_data`

`glance_data_in_workbook`