10

So, I have a fairly large dataset (Dropbox: csv file) that I'm trying to plot using geom_boxplot. The following produces what appears to be a reasonable plot:

require(reshape2)
require(ggplot2)
require(scales)
require(grid)
require(gridExtra)

df <- read.csv("\\Downloads\\boxplot.csv", na.strings = "*")
df$year <- factor(df$year, levels = c(2010,2011,2012,2013,2014), labels = c(2010,2011,2012,2013,2014))

d <- ggplot(data = df, aes(x = year, y = value)) +
    geom_boxplot(aes(fill = station)) + 
    facet_grid(station~.) +
    scale_y_continuous(limits = c(0, 15)) + 
    theme(legend.position = "none"))
d

However, when you dig a little deeper, problems creep in that freak me out. When I labeled the boxplot medians with their values, the following plot results.

df.m <- aggregate(value~year+station, data = df, FUN = function(x) median(x))
d <- d + geom_text(data = df.m, aes(x = year, y = value, label = value)) 
d

boxplots-with-medians-labelled

The medians plotted by geom_boxplot aren't at the medians at all. The labels are plotted at the correct y-axis value, but the middle hinge of the boxplots are definitely not at the medians. I've been stumped by this for a few days now.

What is the reason for this? How can this type of display be produced with correct medians? How can this plot be debugged or diagnosed?

vpipkt
  • 1,620
  • 12
  • 17
Ryan Pugh
  • 243
  • 1
  • 9
  • 1
    Your example code has an inconsistency in it. You are calling `geom_text` against `temp.m` but the median was computed into `turb.m`. Could this be the issue? – vpipkt Mar 27 '15 at 16:17
  • Ah! Good call on that... I tried to remove my inconsistencies from the original code, but I missed that one. That error would cause the geom_text layer to fail, but even without the geom_text added to the plot, the medians are still drawn incorrectly on the boxplots. – Ryan Pugh Mar 27 '15 at 18:45
  • Is the "*" in the `value` field to be interpreted as NA? – vpipkt Mar 27 '15 at 18:55
  • And what data type is `year` in your data frame? – vpipkt Mar 27 '15 at 19:02
  • When I read in the data using read.csv, I set na.strings = "*". – Ryan Pugh Mar 30 '15 at 12:05
  • (Edit timed out...) I've tried df$year as numeric, factor, and int to no avail. It's strange, because some of the boxplots appear correctly, with the medians labelled as expected. When I subset down to a year and station that isn't plotting correctly, I'm still getting a weird boxplot. I've scoured the data with a minimal dataset and I can't find any problem! – Ryan Pugh Mar 30 '15 at 12:18
  • Can you post your code to read the posted dataset as well? I do not find the same problem of your post. The boxplots I see are heavily right skewed and the median text are right on the hinge because the whole box is quite flat. – vpipkt Mar 30 '15 at 13:08
  • 1
    I've edited the original post to include the full code to generate the faceted plot. As you can see [here](https://www.dropbox.com/s/5hdnttszxss29c7/plot.tif?dl=0), where the labels fail to fall on the boxplot horizontal line, there's a problem. I've gone as far as to pare down the dataset to a single station (discharge), using only 2012 data and I still get the exact same boxplot. – Ryan Pugh Mar 30 '15 at 13:29

1 Answers1

11

The solution to this question is in the application of scale_y_continuous. ggplot2 will perform operations in the following order:

  1. Scale Transformations
  2. Statistical Computations
  3. Coordinate Transformations

In this case, because a scale transformation is invoked, ggplot2 excludes data outside the scale limits for the statistical computation of the boxplot hinges. The medians calculated by the aggregate function and used in the geom_text instruction will use the entire dataset, however. This can result in different median hinges and text labels.

The solution is to omit the scale_y_continuous instruction and instead use:

d <- ggplot(data = df, aes(x = year, y = value)) +
geom_boxplot(aes(fill = station)) + 
facet_grid(station~.) +
theme(legend.position = "none")) +
coord_cartesian(y = c(0,15))

This allows ggplot2 to calculate the boxplot hinge stats using the entire dataset, while limiting the plot size of the figure.

Ryan Pugh
  • 243
  • 1
  • 9
  • 1
    This is a major problem with ggplot2 imho. I've been using it for a long time without picking up on this - there ought to be warnings. – ajrwhite Jan 12 '17 at 09:44
  • 1
    It had made me very cautious - and nervous - about how I interpret my data. I questioned everything about R when I first encountered this,until I finally understood what was going on. I'm sure there are a number of casual R/ggplot users who aren't aware that this could be impacting their work. – Ryan Pugh Jan 13 '17 at 12:05
  • @Ryan Pugh I would be grateful for examples of "statistical computations" within the ggplot code! – Agile Bean Oct 20 '19 at 06:35