The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code! We will use < 10 lines of code and just 6 function names to explore penguins:
function | package | description |
---|---|---|
library | {base} | load a package |
filter() | {dplyr} | subset rows using column values |
describe() | {explore} | describe variables of the table |
explore() | {explore} | explore graphically a variable |
explore_all() | {explore} | explore all variables of the table |
explain_tree() | {explore} | explain a target using a decision tree |
The penguins dataset comes with the palmerpenguins package. It has 344 observations and 8 variables. (https://github.com/allisonhorst/palmerpenguins)
So we have to load the palmerpenguins package. Furthermore, we use the packages {dplyr} for filter() and %>% and {explore} for data exploration.
library(palmerpenguins)
library(dplyr)
library(explore)
%>% describe()
penguins #> # A tibble: 8 x 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 species fct 0 0 3 NA NA NA
#> 2 island fct 0 0 3 NA NA NA
#> 3 bill_length_mm dbl 2 0.6 165 32.1 43.9 59.6
#> 4 bill_depth_mm dbl 2 0.6 81 13.1 17.2 21.5
#> 5 flipper_length_mm int 2 0.6 56 172 201. 231
#> 6 body_mass_g int 2 0.6 95 2700 4202. 6300
#> 7 sex fct 11 3.2 3 NA NA NA
#> 8 year int 0 0 3 2007 2008. 2009
There are some NA-values (unknown values) in the data. The variable containing the most NAs is sex. flipper_length_mm and others contain only 2 observations with NAs.
We use only penguins with known flipper length for the data exploration!
<- penguins %>%
data filter(flipper_length_mm > 0)
We reduced the penguins from 344 to 342.
%>%
data explore_all()
What is the relationship between all the variables and species?
%>%
data explore_all(target = species)
We already see some strong patterns in the data. flipper_length_mm seperates species Gentoo, bill_length_mm seperates species Adelie from Chinstrap. And we see that Chinstrap and Gentoo are located on seperate islands.
Now we explain species using a decision tree:
%>% explain_tree(target = species) data
We found an easy explanation how to find out the species by just using flipper_length_mm and bill_length_mm.
Now let’s take a closer look to these variables:
%>% explore(flipper_length_mm, bill_length_mm, target = species) data
The plot shows a not perfect but good seperation between the 3 species!