3

I have a data that looks like the following dataframe, but every combo has about ten fields, starting with name1, adress1, city1, etc

   id name1  adress1 name2  adress2  name3  adress3
1  1  John street a  Burt street d  chris street 1
2  2  Jack street b   Ben street e connor street 2
3  3  Joey     <NA>   Bob street f   <NA>     <NA>  

Now I would like to rearrange this data so it is a bit more useful and it should look like so, but with the information from which entry it came from:

      id origin  names adresses
1  1      1   John street a
2  2      1   Jack street b
3  3      1   Joey     <NA>
4  1      2   Burt street d
5  2      2    Ben street e
6  3      2    Bob street f
7  1      3  chris street 1
8  2      3 connor street 2     

Using tidyr I can get a long format, but then I have a key column that contains all the variable names, name1, name2, name3, street1, etc.

I also tried using separate dataframes, one for each combination, e.g. one dataframe for the names, one for the streets, etc. But then joining everything back together results in the wrong records, because you can only join on id and in a long format this ID is replicated. I have also been looking into Reshape2, but that results in the same issue.

All the conversions of wide to long I have seen are when you have one column you want to convert to. I'm looking for the end result in 10 columns, or as in the example 2 columns.

Is there a function that I'm overlooking?

#code to generete the dataframes:
df <- data.frame(id = c(1,2,3), 
                 name1 = c("John", "Jack", "Joey"), 
                 adress1 = c("street a", "street b", NA), 
                 name2 = c("Burt", "Ben", "Bob"),
                 adress2 = c("street d", "street e", "street f"),
                 name3 = c("chris", "connor", NA),
                 adress3 = c("street 1", "street 2", NA),
                 stringsAsFactors = FALSE)


expecteddf <- data.frame(id = c(1,2,3,1,2,3,1,2), 
                         origin = c(rep(1, 3), rep(2, 3), rep(3, 2)), 
                         names = c("John", "Jack", "Joey", "Burt", "Ben", "Bob", "chris", "connor"), 
                         adresses = c("street a", "street b", NA, "street d", "street e", "street f", "street 1", "street 2"),
                         stringsAsFactors = FALSE


                   )
phiver
  • 19,366
  • 14
  • 36
  • 42

1 Answers1

4

We could use melt from the devel version of data.table which can take multiple patterns for the measure columns. Instructions to install the devel version of 'data.table' is here

We convert the 'data.frame' to 'data.table' (setDT(df)), melt, and specify the regex in the patterns of measure argument. Remove the rows that are NA for the 'names' and 'address' column.

library(data.table)#v1.9.5+
dM <- melt(setDT(df), measure=patterns(c('^name', '^adress')),
          value.name=c('names', 'address') )
dM[!(is.na(names) & is.na(address))]
# id variable  names  address
#1:  1        1   John street a
#2:  2        1   Jack street b
#3:  3        1   Joey       NA
#4:  1        2   Burt street d
#5:  2        2    Ben street e
#6:  3        2    Bob street f
#7:  1        3  chris street 1
#8:  2        3 connor street 2

Or we can use reshape from base R.

 dM2 <- reshape(df, idvar='id', varying=list(grep('name', names(df)), 
             grep('adress', names(df))), direction='long')

The NA rows can be removed as in the data.table solution by using standard 'data.frame' indexing after we create the logical index with is.na.

akrun
  • 674,427
  • 24
  • 381
  • 486
  • 1
    Both work as a charm. Looks like I have to look deeper into the data.table package to see what other gems are hiding in there. – phiver Aug 30 '15 at 15:00
  • @phiver Glad to know that it works. You should check out the devel version as it introduced some new functions. – akrun Aug 30 '15 at 15:01