I’ve been adapting some material to teach data visualisation in ggplot2 and looking to include some advice to avoid common data misrepresentation pitfalls.
For example, in barplots, where absolute differences are of primary interest, the y-axis should almost always include 0 otherwise it would violate the principle of proportional ink to convey quantity. Indeed ggplot2::geom_bar defaults to extending the y axis to 0 and does not take to the use of + ylim(0,...).
Time-series on the other hand, in which the rate of change and trend along the x axis is of most interest and the “ink” used to represent the trend does not encode quantitative information, the y-axis need not include 0 and forcing it could indeed misrepresent features of interest in the data - when area is used instead of a line, however, proportional ink becomes important again.
Given these considerations what are thoughts on the y-axis in box and violin plots?geom_box and geom_violin default to clipping the y-axis, highlighting rightly the differences in the distributions and summaries of y data points across the x variable. But should we also be considering differences in such distributions in the context of the absolute scale of the data? ie should box and violin plots ever be plotted with a y-axis extended to include 0? If so when/why?
This is a really great question, @annakrystalli! I think I have a good case where including 0 in the y axis makes sense. Take a look at this graphic from my MS work:
Here, I included 0 in the y axis because I wanted it to be easy to find the absolute differences by reach. It’s not a time series, per se, but it does have a continuous x axis like a time series so I think it’s a relevant example.
Does this seem like a good example of a good time to include 0 in the y axis?
If the variation among categories doesn’t include in the first quartile any observations close to zero, then the y axis doesn’t need to include zero. I’m thinking of population by country, height average in cm, life expectation, sold tickets, average size, etc. None of these will actually include observations close to zero and in case of millions of people like in population, it is unnecessary to show half the plot empty. Maybe that’s a rule?
Also, I wanted to mention that raincloud plots much cooler than violin plots. (it is a hybrid mixture of density, boxplot and dotplot) most useful when having lots of data
Oops sorry seeing this very late! I don’t have particular thoughts but I remember recently seeing a cool hybrid plot, half violin half raw data… I can’t find it right now.