0

I am using R to merge more than two data frames. Each data frame corresponds to one year of observations and there is a column that uniquely identifies an employee. For example for two years I would observe:

 emp.data <- data.frame(
  emp_id = c (1:5), 
  emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
  start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
                         "2015-03-27")),
  salary = c(623.3,515.2,611.0,729.0,843.25), 
  hours=c(100,56,34,65,38),
  
  stringsAsFactors = FALSE
)
# Print the data frame.         
print(emp.data) 

(https://stackoverflow.com/image.jpg)

and:

 emp.data2 <- data.frame(
  emp_id = c (2:6), 
  emp_name = c("Dan","Michelle","Ryan","Gary","Zack"),
  start_date = as.Date(c( "2013-09-23", "2014-11-15", "2014-05-11","2014-02-11",
                         "2015-03-27")),
  salary = c(515.2,611.0,729.0,843.25,10), 
  hours=c(56,34,65,38,9),
  
  stringsAsFactors = FALSE
)
# Print the data frame.         
print(emp.data2) 

enter image description here

Notice that I am essentially dealing with an unbalanced panel: columns across the years are the same, however the number of observations may vary across the years.

Essentially, I would like to merge around 10 of these data frames into one, I have tried using the function merge() in the following way:

merge(emp.data,emp.data2,by = "emp_id",all.x = TRUE,all.y = TRUE)

however it makes dimensions of the data frame increase proportionally to the amount of data frames merged since is gives me the output columns: hours.x hours.y, for example. This is clearly very inefficient since only one of those duplicate columns would have non-NA values. So essentially I am looking for a way to merge these data frames while keeping the columns constant and adding a year variable. Any suggestions?

Duck
  • 37,428
  • 12
  • 34
  • 70

1 Answers1

2
## Put the data frames in a list:
list_of_data = list(emp.data, emp.data.2, emp.data.3, ...)
## alternately, do this programmatically, maybe
# list_of_data = mget(ls(pattern = emp.data.*))

## Name the list with the years
names(list_of_data) = c(1996, 2014, 1066, ...)

## Combine it 
library(dplyr)
big_data = bind_rows(list_of_data, .id = "year")

For more details on this, see my answer at the FAQ How to make a list of data frames.

Gregor Thomas
  • 104,719
  • 16
  • 140
  • 257
  • That works great, thanks. An additional question. Do you have a suggestion on how to deal with cases in which columns may change over time? Some might disappear and some may appear, for instance. – Weierstraß Ramirez Aug 19 '20 at 15:58
  • 1
    This method will still work - you'll just get missing values in the years they aren't present. – Gregor Thomas Aug 19 '20 at 16:31