While boxplots have become the de facto standard for plotting the distribution of data this is a vast oversimplification and may not show everything needed to evaluate the variation of data. This is particularly important for datasets which do not form a Gaussian “Normal” distribution that most researchers have become accustomed to.
While density plots are helpful in this regard, they can be less aesthetically pleasing than boxplots and harder to interpret for those familiar with boxplots. Often the only ways to compare multiple data types with density use slices of the data with faceting the plotting panes or overlaying density curves with colours and a legend. This approach is jarring for new users and leads to cluttered plots difficult to present to a wider audience.
##Violin Plots
Therefore violin plots are a powerful tool to assist researchers to visualise data, particularly in the quality checking and exploratory parts of an analysis. Violin plots have many benefits:
As shown below for the iris
dataset, violin plots show distribution information that the boxplot is unable to.
library("vioplot")
## Loading required package: sm
## Package 'sm', version 2.2-5.6: type help(sm) for summary information
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
data(iris)
boxplot(iris$Sepal.Length[iris$Species=="setosa"], iris$Sepal.Length[iris$Species=="versicolor"], iris$Sepal.Length[iris$Species=="virginica"], names=c("setosa", "versicolor", "virginica"))
vioplot(iris$Sepal.Length[iris$Species=="setosa"], iris$Sepal.Length[iris$Species=="versicolor"], iris$Sepal.Length[iris$Species=="virginica"], names=c("setosa", "versicolor", "virginica"))
##Violin Plot Area
However there are concerns that existing violin plot packages (such as ) scales the data to the most aesthetically suitable width rather than maintaining proportions comparable across data sets. Consider the differing distributions shown below:
par(mfrow=c(3, 1))
par(mar=rep(2, 4))
plot(density(iris$Sepal.Length[iris$Species=="setosa"]), main="Sepal Length: setosa", col="green")
plot(density(iris$Sepal.Length[iris$Species=="versicolor"]), main="Sepal Length: versicolor", col="blue")
plot(density(iris$Sepal.Length[iris$Species=="virginica"]), main="Sepal Length: virginica", col="palevioletred4")
par(mfrow=c(1, 1))
#Comparing datasets
Neither of these plots above show the relative distribtions on the same scale, even if we match the x-axis of a density plot the relative heights are obscured and difficult to compare.
par(mfrow=c(3, 1))
par(mar=rep(2, 4))
c(3, 9)
xaxis <- c(0, 1.25)
yaxis <-plot(density(iris$Sepal.Length[iris$Species=="setosa"]), main="Sepal Length: setosa", col="green", xlim=xaxis, ylim=yaxis)
plot(density(iris$Sepal.Length[iris$Species=="versicolor"]), main="Sepal Length: versicolor", col="blue", xlim=xaxis, ylim=yaxis)
plot(density(iris$Sepal.Length[iris$Species=="virginica"]), main="Sepal Length: virginica", col="palevioletred4", xlim=xaxis, ylim=yaxis)
par(mfrow=c(1, 1))
This can somewhat be addressed by overlaying density plots:
par(mfrow=c(1, 1))
c(3, 9)
xaxis <- c(0, 1.25)
yaxis <-plot(density(iris$Sepal.Length[iris$Species=="setosa"]), main="Sepal Length", col="green", xlim=xaxis, ylim=yaxis)
lines(density(iris$Sepal.Length[iris$Species=="versicolor"]), col="blue")
lines(density(iris$Sepal.Length[iris$Species=="virginica"]), col="palevioletred4")
legend("topright", fill=c("green", "blue", "palevioletred4"), legend=levels(iris$Species), cex=0.5)
This has the benefit of highlighting the different distributions of the data subsets. However, notice here that a figure legend become necessary, plot axis limits need to be defined to display the range of all distribution curves, and the plot quickly becomes cluttered if the number of factors to be compared becomes much larger.
##Area control in Violin plot
Therefore the areaEqual
parameter has been added to customise the violin plot to serve a similar purpose:
vioplot(iris$Sepal.Length[iris$Species=="setosa"], iris$Sepal.Length[iris$Species=="versicolor"], iris$Sepal.Length[iris$Species=="virginica"], names=c("setosa", "versicolor", "virginica"), main="Sepal Length", areaEqual = T)
If we compare this to the original vioplot functionality (defaulting to areaEqual = FALSE
) the differences between the two are clear.
par(mfrow=c(2,1))
par(mar=rep(2, 4))
vioplot(iris$Sepal.Length[iris$Species=="setosa"], iris$Sepal.Length[iris$Species=="versicolor"], iris$Sepal.Length[iris$Species=="virginica"], names=c("setosa", "versicolor", "virginica"), main="Sepal Length (Equal Width)", areaEqual = F)
vioplot(iris$Sepal.Length[iris$Species=="setosa"], iris$Sepal.Length[iris$Species=="versicolor"], iris$Sepal.Length[iris$Species=="virginica"], names=c("setosa", "versicolor", "virginica"), main="Sepal Length (Equal Area)", areaEqual = T)
par(mfrow=c(1,1))
Note that areaEqual
is considering the full area of the density distribution before removing the outlier tails. We leave it up to the users discretion which they elect to use. The areaEqual
functionality is compatible with all of the customisation used in discussed in the main vioplot vignette
vioplot(iris$Sepal.Length[iris$Species=="setosa"], iris$Sepal.Length[iris$Species=="versicolor"], iris$Sepal.Length[iris$Species=="virginica"], names=c("setosa", "versicolor", "virginica"), main="Sepal Length (Equal Area)", areaEqual = T, col=c("lightgreen", "lightblue", "palevioletred"), rectCol=c("green", "blue", "palevioletred3"), lineCol=c("darkolivegreen", "royalblue", "violetred4"), border=c("darkolivegreen4", "royalblue4", "violetred4"))
The violin width can further be scaled with wex
, which maintains the proportions across the datasets if areaEqual = TRUE
:
vioplot(iris$Sepal.Length[iris$Species=="setosa"], iris$Sepal.Length[iris$Species=="versicolor"], iris$Sepal.Length[iris$Species=="virginica"], names=c("setosa", "versicolor", "virginica"), main="Sepal Length (Equal Area)", areaEqual = T, col=c("lightgreen", "lightblue", "palevioletred"), rectCol=c("green", "blue", "palevioletred3"), lineCol=c("darkolivegreen", "royalblue", "violetred4"), border=c("darkolivegreen4", "royalblue4", "violetred4"), wex=1.25)
Notice the utility of areaEqual
for cases where different datasets have different underlying distributions:
vioplot(rnorm(200, 3, 0.5), rpois(200, 2.5), rbinom(100, 10, 0.4), rlnorm(200, 0, 0.5), rnbinom(200, 10, 0.9), rlogis(20, 0, 0.5), areaEqual = F, main="Equal Width", xlab="distribution", ylab="data value", names=c("normal", "poisson", "binomial", "log-normal", "neg-binomial", "logistic"))
vioplot(rnorm(200, 3, 0.5), rpois(200, 2.5), rbinom(100, 10, 0.4), rlnorm(200, 0, 0.5), rnbinom(200, 10, 0.9), rlogis(20, 0, 0.5), areaEqual = T, main="Equal Area", xlab="distribution", ylab="data value", names=c("normal", "poisson", "binomial", "log-normal", "neg-binomial", "logistic"))