1

Suppose I have 2 data frames, one for 2015 and one for 2016. I want to run a regression for each data frame and plot one of the coefficient for each regression with their respective confidence interval. For example:

set.seed(1020022316)
library(dplyr)
library(stargazer)

df16 <- data.frame(
  x1 = rnorm(1000, 0, 2),
  t = sample(c(0, 1), 1000, T),
  e = rnorm(1000, 0, 10)
) %>% mutate(y = 0.5 * x1 + 2 * t + e) %>%
  select(-e)

df15 <- data.frame(
  x1 = rnorm(1000, 0, 2),
  t = sample(c(0, 1), 1000, T),
  e = rnorm(1000, 0, 10)
) %>% mutate(y = 0.75 * x1 + 2.5 * t + e) %>%
  select(-e)

lm16 <- lm(y ~ x1 + t, data = df16)

lm15 <- lm(y ~ x1 + t, data = df15)

stargazer(lm15, lm16, type="text", style = "aer", ci = TRUE, ci.level = 0.95)

I want to plot t=1.558, x=2015, and t=2.797, x=2016 with their respective .95 CI. What is the best way of doing this?

I could do it 'by hand', but I hope there is a better way.

library(ggplot2)
df.plot <-
  data.frame(
    y = c(lm15$coefficients[['t']], lm16$coefficients[['t']]),
    x = c(2015, 2016),
    lb = c(
      confint(lm15, 't', level = 0.95)[1],
      confint(lm16, 't', level = 0.95)[1]
    ),
    ub = c(
      confint(lm15, 't', level = 0.95)[2],
      confint(lm16, 't', level = 0.95)[2]
    )
  )
df.plot %>% ggplot(aes(x, y)) + geom_point() +
  geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1) + 
  geom_hline(aes(yintercept=0), linetype="dashed")

enter image description here


Best: The figure quality (looks nice), code elegance, easy to expand (more than 2 regressions)

Ignacio
  • 6,566
  • 8
  • 51
  • 99
  • 1
    You ask for the "best" way of doing something but don't describe the criteria used to judge what "best" means. What exactly is the problem you are trying to solve? What is the desired input and what is the desired output? – MrFlick Feb 23 '16 at 16:01
  • Not everyone loves dplyr and it is not necessary for a reproducible example here ... – Roland Feb 23 '16 at 16:08
  • 1
    The `broom` package might be the better way you're looking for. – Gregor Thomas Feb 23 '16 at 16:13
  • Thanks @Gregor. I try to figure out how to do it with `broom`. Right now what I have works, but is not 'elegant' and doing it for many regressions would be a pain. – Ignacio Feb 23 '16 at 16:18
  • 2
    You might benefit from fixing your issue upstream: work on efficiently fitting multiple models and extracting the data you want – Heroka Feb 23 '16 at 16:27
  • Strongly agree with Heroka. The implications of `df15` and `df16` are pretty terrible: [you should be using lists of data frames and lists of models](http://stackoverflow.com/a/24376207/903061), not sequentially named variables with identical or nearly identical code copied and pasted. – Gregor Thomas Feb 23 '16 at 22:50
  • @Gregor `df15` and `df16` are extremely simple examples that i created for a minimum reproducible example of the problem I was trying to solve... It was not my intention to make that the center of my question... – Ignacio Feb 23 '16 at 22:54
  • Just the sequential naming... the implication that you're creating very similar data sets and fitting very similar models, but not containing them in a list. – Gregor Thomas Feb 24 '16 at 02:28
  • Just copy and paste to make the example as simple as possible... – Ignacio Feb 24 '16 at 02:29

3 Answers3

7

This is a bit too long for a comment, so I post it as a partial answer.

It is unclear from your post if your main problem is to get the data into the right shape, or if it is the plotting itself. But just to follow up on one of the comments, let me show you how to do run several models using dplyr and broom that makes plotting easy. Consider the mtcars-dataset:

 library(dplyr)
 library(broom)
 models <- mtcars %>% group_by(cyl) %>% 
           do(data.frame(tidy(lm(mpg ~ disp, data = .),conf.int=T )))

 head(models) # I have abbreviated the following output a bit

    cyl        term estimate std.error statistic   p.value conf.low conf.high
  (dbl)       (chr)    (dbl)     (dbl)     (dbl)     (dbl)    (dbl)     (dbl)
     4 (Intercept)  40.8720    3.5896     11.39 0.0000012   32.752  48.99221
     4        disp  -0.1351    0.0332     -4.07 0.0027828   -0.210  -0.06010
     6 (Intercept)  19.0820    2.9140      6.55 0.0012440   11.591  26.57264
     6        disp   0.0036    0.0156      0.23 0.8259297   -0.036   0.04360

You see that this gives you all coefficients and confidence intervals in one nice dataframe, which makes plotting with ggplot easier. For instance, if your datasets have identical content, you could add a year identifier to them (e.g. df1$year <- 2000; df2$year <- 2001 etc), and bind them together afterwards (e.g. using bind_rows, of you can use bind_rows's .id option). Then you can use the year identifer instead of cyl in the above example.

The plotting then is simple. To use the mtcars data again, let's plot the coefficients for disp only (though you could also use faceting, grouping, etc):

 ggplot(filter(models, term=="disp"), aes(x=cyl, y=estimate)) + 
          geom_point() + geom_errorbar(aes(ymin=conf.low, ymax=conf.high))

To use your data:

 df <- bind_rows(df16, df15, .id = "years")

 models <- df %>% group_by(years) %>% 
           do(data.frame(tidy(lm(y ~ x1+t, data = .),conf.int=T ))) %>%
           filter(term == "t") %>% 
           ggplot(aes(x=years, y=estimate)) + geom_point() + 
           geom_errorbar(aes(ymin=conf.low, ymax=conf.high)) 

Note that you can easily add more and more models just by binding more and more data to the main dataframe. You can also easily use faceting, grouping or position-dodgeing to adjust the look of the corresponding plot if you want to plot more than one coefficient.

coffeinjunky
  • 9,583
  • 32
  • 50
0

This is the solution I have right now:

gen_df_plot <- function(reg, coef_name){
  df <- data.frame(y = reg$coefficients[[coef_name]],
                   lb = confint(reg, coef_name, level = 0.95)[1],
                   ub = confint(reg, coef_name, level = 0.95)[2])
  return(df)
}

df.plot <- lapply(list(lm15,lm16), gen_df_plot, coef_name = 't')

df.plot <- data.table::rbindlist(df.plot)

df.plot$x <- as.factor(c(2015, 2016))

df.plot %>% ggplot(aes(x, y)) + geom_point(size=4) +
  geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1, linetype="dotted") + 
  geom_hline(aes(yintercept=0), linetype="dashed") + theme_bw()

enter image description here

I don't love it, but it works.

Ignacio
  • 6,566
  • 8
  • 51
  • 99
0

Here is what might be generalized code. I have made a change to how "x" is defined so that you don't have to worry about alphabetic reordering of the factor.

#
# Paul Gronke and Paul Manson
# Early Voting Information Center at Reed College
#
# August 27, 2019
#
#
# Code to plot a single coefficient from multiple models, provided
# as an easier alternative to "coefplot" and "dotwhisker". Some users
# may find those packages more capable
#
# Code adapted from https://stackoverflow.com/questions/35582052/plot-regression-coefficient-with-confidence-intervals


# gen_df_plot function will create a tidy data frame for your plot
#   Currently set up to display 95% confidence intervals

gen_df_plot <- function(reg, coef_name){
  df <- data.frame(y = reg$coefficients[[coef_name]],
                   lb = confint(reg, coef_name, level = 0.95)[1],
                   ub = confint(reg, coef_name, level = 0.95)[2])
  return(df)
}

# Populate the data frame with a list of your model results.

df.plot <- lapply(list(model1,      # List your models here
                       model2), 
                  gen_df_plot, 
                  coef_name = 'x1') # Coefficient name

  # Convert the list to a tidy data frame

df.plot <- data.table::rbindlist(df.plot)

# Provide the coefficient or regression labels below, in the
# order that you want them to appear. The "levels=unique(.)" parameter
# overrides R's desire to order the factor alphabetically

df.plot$x <- c("Group 1", 
               "Group 2") %>%
  factor(., levels = unique(.),
         ordered = TRUE)

# Create your plot

df.plot %>% ggplot(aes(x, y)) + 
  geom_point(size=4) +
  geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1, linetype="dotted") + 
  geom_hline(aes(yintercept=0), linetype="dashed") + 
  theme_bw() +
  ggtitle("Comparing Coefficients") +
  ylab("Coefficient Value")```