0

I have several CSV files for each year. Each file contains the same variables and observations.

df14 <- data.frame(name = c("one", "two", "three"), A = c(1,2,3), B = c(4, 2, 1), C = c(0, 1, 1))
df15 <- data.frame(name = c("one", "two", "three"), A = c(3,1,1), C = c(0, 0, 1), B = c(8, 5, 5))

Suppose df14 & df15 represent years 2014 & 2015 respectively.

Note: the variables are not recorded in the same order.

What I'd like to do is see how each variable (A, B, C) are changing by year for each name.

Is there a way to combine these in one data frame? Should I simply rbind them?

Update:

One thing I could do is assign the years as a new variable and rbind but is it good practice?

df14$year <- 2014; df15$year <- 2015
df <- rbind(df14, df15)

which gives:

   name A B C year
   one 1 4 0 2014
   two 2 2 1 2014
   three 3 1 1 2014
   one 3 8 0 2015
   two 1 5 0 2015
   three 1 5 1 2015
zx8754
  • 42,109
  • 10
  • 93
  • 154
user10853
  • 266
  • 1
  • 3
  • 17

3 Answers3

1

TRY:

library(data.table)
library(magrittr)
years_2_digt <- 14:15

DT <- 
rbindlist(lapply(years_2_digt, function(y) {
  get(paste0("df", y)) %>% 
  setDT %>% 
  .[, year := y] %>%
  setkeyv("name")
}))


DT.molt <- reshape2::melt(DT, id.vars=c("name", "year"))

library(ggplot2)
ggplot(data=DT.molt, aes(x=year, color=variable, y=value)) + 
    geom_line() + geom_point() + 
    facet_grid(name ~ .) + 
    ggtitle("Change by year and name")

enter image description here

Ricardo Saporta
  • 51,025
  • 13
  • 129
  • 166
  • The setcolorder part is new to me; thanks. As for the others I can't see how the result is different. The thing is that all frames have exactly the same variables (3 in the example). Moreover the observations are exactly 3 (in this case). The observations (rows) are represented by the "name" col. They are names of districts. – user10853 May 13 '15 at 20:15
  • @user10853 The `setcolorder` function is from the `data.table` package. It's not clear from your question what you want the ultimate result to be. If you could show what your desired output would be, it would be easier to help – Ricardo Saporta May 13 '15 at 20:17
  • I restated it now: see how each variable (A, B, C) are changing by year for each "name". – user10853 May 13 '15 at 20:17
  • You indeed managed to get the result. Could you please explain what the function does? Maybe there's no way to do what I've been thinking; the best option is a relational database i guess. Any way I'll mark your answer as a solution. Any new suggestions are appreciated too. – user10853 May 13 '15 at 20:43
  • to be specific, this part is weird: `setDT %>% "["(j=`:=`(year, y)) %>% setkeyv("name")` – user10853 May 13 '15 at 20:49
  • also how can I plot it for only one `name` ? – user10853 May 13 '15 at 20:55
  • The middle command of the "weird" part has a variant I think I've seen folks use: `setDT %>% .[,year:=y] %>% setkeyv("name")`. @user10853 All three commands are from the `data.table` package (loaded in the first line). – Frank May 13 '15 at 21:00
1

You can programmatically add the year column to each data frame and then rbind them. Here's an example that relies on being able to get the year corresponding to each data frame from the file name. Here, I've stored you sample data frames in a list. In your real use case, you'd read the csv files into a list using something like df.list = sapply(vector_of_file_names, read.csv).

df.list = list(df14=df14, df15=df15)

df.list = lapply(1:length(df.list), function(i) {
  df.list[[i]] = data.frame(df.list[[i]], 
                            year = 2000 + as.numeric(gsub(".*(\\d{2})\\.csv","\\1", names(df.list)[[i]])))
})

df = do.call(rbind, df.list)
eipi10
  • 81,881
  • 20
  • 176
  • 248
  • If you want a list, is there a reason to use `sapply` over `lapply` (in your first paragraph)? – Frank May 13 '15 at 20:24
  • 2
    So that the name of the file will become the name of the corresponding list element, which I then use in the function to get the year. – eipi10 May 13 '15 at 20:25
1

Here is a working example within one lapply:

Make some dummy CSV files:

df14 <- data.frame(name = c("one", "two", "three"), A = c(1,2,3), B = c(4, 2, 1), C = c(0, 1, 1))
df15 <- data.frame(name = c("one", "two", "three"), A = c(3,1,1), C = c(0, 0, 1), B = c(8, 5, 5))
df16 <- data.frame(name = c("one", "two", "three"), C = c(1,2,3), B = c(4, 2, 1), A = c(0, 1, 1))
df17 <- data.frame(name = c("one", "two", "three"), C = c(3,1,1), A = c(0, 0, 1), B = c(8, 5, 5))
#get dataframe names
myNames <- ls()[grepl("df",ls())]
lapply(myNames, function(i){write.csv(get(i),paste0(i,".csv"),row.names = FALSE)})

Solution: read CSV files, fix columns using sort, then rbind them into one dataframe:

#Solution - read CSV, fix columns, rbind
do.call(rbind,
        lapply(list.files(".","^df\\d*.csv"),
               function(i){
                 d <- read.csv(i)
                 res <- d[,sort(colnames(d))]
                 cbind(res,FileName=i)
               }))
# output
#    A B C  name FileName
# 1  1 4 0   one df14.csv
# 2  2 2 1   two df14.csv
# 3  3 1 1 three df14.csv
# 4  3 8 0   one df15.csv
# 5  1 5 0   two df15.csv
# 6  1 5 1 three df15.csv
# 7  0 4 1   one df16.csv
# 8  1 2 2   two df16.csv
# 9  1 1 3 three df16.csv
# 10 0 8 3   one df17.csv
# 11 0 5 1   two df17.csv
# 12 1 5 1 three df17.csv
zx8754
  • 42,109
  • 10
  • 93
  • 154
  • The other answers also basically do it in one command (rbindlist+lapply or do.call rbind + lapply), except for the reading in part, but +1 because this is exactly what I often do (tossing file names in as temporary a "src" column). – Frank May 13 '15 at 20:41