y axis limits for box and violin plot

annakrystalli · March 15, 2018, 5:13pm

I’ve been adapting some material to teach data visualisation in ggplot2 and looking to include some advice to avoid common data misrepresentation pitfalls.

I’ve been looking at the Calling b*****t visualisation blog posts, in particular, the Misleading axes on graphs post and the recommendations on when and when not to force the y-axis to include 0.

For example, in barplots, where absolute differences are of primary interest, the y-axis should almost always include 0 otherwise it would violate the principle of proportional ink to convey quantity. Indeed ggplot2::geom_bar defaults to extending the y axis to 0 and does not take to the use of + ylim(0,...).

Time-series on the other hand, in which the rate of change and trend along the x axis is of most interest and the “ink” used to represent the trend does not encode quantitative information, the y-axis need not include 0 and forcing it could indeed misrepresent features of interest in the data - when area is used instead of a line, however, proportional ink becomes important again.

Given these considerations what are thoughts on the y-axis in box and violin plots? geom_box and geom_violin default to clipping the y-axis, highlighting rightly the differences in the distributions and summaries of y data points across the x variable. But should we also be considering differences in such distributions in the context of the absolute scale of the data? ie should box and violin plots ever be plotted with a y-axis extended to include 0? If so when/why?

sckott · March 16, 2018, 4:34pm

perhaps @maelle @aammd or others have thoughts on this?

annakrystalli · March 16, 2018, 7:27pm

Perhaps one for stackoverflow even

sckott · March 16, 2018, 7:28pm

could be, i’d try to help, but i spend about 1% of my time making plots, so i’m pretty useless

brycem · March 16, 2018, 7:41pm

This is a really great question, @annakrystalli! I think I have a good case where including 0 in the y axis makes sense. Take a look at this graphic from my MS work:

Here, I included 0 in the y axis because I wanted it to be easy to find the absolute differences by reach. It’s not a time series, per se, but it does have a continuous x axis like a time series so I think it’s a relevant example.

Does this seem like a good example of a good time to include 0 in the y axis?

orchid00 · March 24, 2018, 8:41pm

If the variation among categories doesn’t include in the first quartile any observations close to zero, then the y axis doesn’t need to include zero. I’m thinking of population by country, height average in cm, life expectation, sold tickets, average size, etc. None of these will actually include observations close to zero and in case of millions of people like in population, it is unnecessary to show half the plot empty. Maybe that’s a rule?

Also, I wanted to mention that raincloud plots much cooler than violin plots. (it is a hybrid mixture of density, boxplot and dotplot) most useful when having lots of data

maelle · May 27, 2018, 5:13pm

Oops sorry seeing this very late! I don’t have particular thoughts but I remember recently seeing a cool hybrid plot, half violin half raw data… I can’t find it right now.

cboettig · May 28, 2018, 6:07pm

@maelle you weren’t thinking about https://twitter.com/thomasp85/status/993586836529467392 by chance?

maelle · May 29, 2018, 11:02am

No but that’s a very cool one, thanks for sharing!

Topic		Replies	Views
weathercan: Exploring extreme weather events in my neighbourhood UseCases weathercan	6	1534	March 28, 2019
ggethos: gauging interest in ggplot extension to plot ethograms and potential collaborators Wishlist r , package	1	629	February 24, 2023
Why are the default values for viridis 'light is more'? General Q&A visualization	0	607	April 15, 2021
Getting and plotting weather and climate data UseCases weathercan	3	1660	September 9, 2021
rOpenSci \| How to Save ggplot2 Plots in a targets Workflow? Blog	2	245	January 8, 2025

y axis limits for box and violin plot

Related topics