How do I add a legend to a ggplot with multiple groups that have multiple columns of data in a dataframe

Question

I have a 60x13 dataframe that contains 4 groups of data, 1 column is time in months 1-60, 1 column is for the median value at timepoint n and 2 for the credible intervals at time point n. I want to produce a plot that has a solid line for the median and dashed lines for the confidence intervals over time, I've been able to do this by adding each column as it's own geom_line and grouping manually by matching the colours of the medians and their corresponding credible intervals, however, I am unable to add a legend. Any help would be appreciated thanks.

ggplot(data=data1, 
       aes(x=month)) +
  xlab("Month") +
  ylab("Hazard Ratio") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_line(aes(y=median),
            color = "#4682B4",
            size = 1) +
  geom_line(aes(y=ucrd),
            color = "#4682B4",
            linetype=2,
            size = 0.9,
            alpha=0.5) +
  geom_line(aes(y=lcrd),
            color = "#4682B4",
            linetype=2,
            size = 0.9,
            alpha=0.5) +
  geom_line(aes(y=median.1),
            color = "#4BB446",
            size = 1) +
  geom_line(aes(y=ucrd.1),
              color = "#4BB446",
            linetype=2,
            size = 0.9,
            alpha=0.5) +
  geom_line(aes(y=lcrd.1),
            color = "#4BB446",
            linetype=2,
            size = 0.9,
            alpha=0.5) + 
  geom_line(aes(y=median.2),
          color = "#AF46B4",
          size = 1) +
  geom_line(aes(y=ucrd.2),
            color = "#AF46B4",
            linetype=2,
            size = 0.9,
            alpha=0.5) +
  geom_line(aes(y=lcrd.2),
            color = "#AF46B4",
            linetype=2,
            size = 0.9,
            alpha=0.5) +
  geom_line(aes(y=median.3),
            color = "#B47846",
            size = 1) +
  geom_line(aes(y=ucrd.3),
            color = "#B47846",
            linetype=2,
            size = 0.9,
            alpha=0.5) +
  geom_line(aes(y=lcrd.3),
            color = "#B47846",
            linetype=2,
            size = 0.9,
            alpha=0.5)
scale_color_manual(name= "Treament",
                     values=c("4682B4", "4BB446", "AF46B4", "B47846"),
                       labels=c("a", 
                                "b",
                                "c",
                                "d"
                       ))

teunbrand · Answer 1 · 2020-07-06T16:48:55.157

This sounds a lot like a data shape problem. Since no data was provided, here is an example with dummy data. First we generate some data roughly in the shape of what you mention in the text.

library(tidyr)
library(ggplot2)

n <- 60
df <- data.frame(
  time = seq_len(n),
  group1_median = rnorm(n),
  group1_low = rnorm(n, -2),
  group1_high = rnorm(n, 2),
  group2_median = rnorm(n),
  group2_low = rnorm(n, -2),
  group2_high = rnorm(n, 2),
  group3_median = rnorm(n),
  group3_low = rnorm(n, -2),
  group3_high = rnorm(n, 2),
  group4_median = rnorm(n),
  group4_low = rnorm(n, -2),
  group4_high = rnorm(n, 2)
)

Now, we are going to reshape this from a wide format to a long format. What exactly the following function should look like depends a lot on the column names of your data. I chose the dummy data column names to be pretty easy.

df <- pivot_longer(
  df, -time,
  names_to = c("group", "metric"),
  names_sep = "_"
)

Because median, low and high are now regarded as seperate observations, we need to reshape the data again to make it slightly wider.

df <- pivot_wider(
  df, names_from = "metric"
)

Then it is in pretty decent shape to put into ggplot2, and the legend will sort out itself.

ggplot(df, aes(time, colour = group)) +
  geom_line(aes(y = median)) +
  geom_ribbon(aes(ymin = low, ymax = high),
              linetype = 2, fill = NA)

If anyone has more appropriate reshape strategies, I'd love to hear them because I'm still learning to pivot correctly too.

your pivoting seems pretty neat to me. I remember to have seen once a comment or answer from Hadley that probably every reshape can be done with 2-3 steps of (back then gather and spread). It was one of those questions where someone asked for gathering (or spreading, I forgot) several variables to several other variables at a time. I personally would probably favour a more "positive" selection of columns, e.g. `cols = matches("group")`, but this depends so much on the given data, that it's hardly generalizable — tjebo, Jul 06 '20 at 17:11
not exactly the comment I meant, but it's getting close. Also see his comment to akrun's answer https://stackoverflow.com/a/25932131/7941188 — tjebo, Jul 06 '20 at 17:20
Thanks Tjebo! Might do a bit of ggplot2 here and there but I'm not a heavy tidyverse data wrangling user :') — teunbrand, Jul 06 '20 at 18:06

How do I add a legend to a ggplot with multiple groups that have multiple columns of data in a dataframe

1 Answers1