3

Converting a list of data frames: not a simple rbind, second row to new columns

I have a list

employeesList = list(data.frame(first = ("Al"), second = "Jones"), 
                 data.frame(first = c("Al", "Barb"), second = c("Jones",       "Smith")),
             data.frame(first = c("Al", "Barb", "Carol"), second =   c("Jones", "Smith", "Adams")),
             data.frame(first = ("Al"), second = "Jones"))

I am looking to produce this.

employeesDF = data.frame(first = c("Al", "Al", "Al", "Al"), second = c("Jones", "Jones", "Jones", "Jones"),
                        first2 = c(NA, "Barb", "Barb", NA), second2 = c(NA, "Smith", "Smith", NA),
                        first3 = c(NA, NA, "Carol", NA), second3 = c(NA, NA, "Adams", NA))

I would like each data frame to be a row in the results data frame. Note that the first data frame when converted will have two columns, the second df when converted will have four columns, the third df when converted will produce 6 columns, the fourth df when converted will produce 2 columns and so on. I realize there will have to be a fill of NA values I have done some research and the problem is solved if the goal were simply to rbind. I do not see a way to solve my problem. Convert a list of data frames into one data frame

After reading this, Combine two data frames by rows (rbind) when they have different sets of columns I got a start with

res1 = cbind(t(employeesList[[1]][1]), t(employeesList[[1]][2]))
res2 = cbind(t(employeesList[[2]][1]), t(employeesList[[2]][2]))
res3 = cbind(t(employeesList[[3]][1]), t(employeesList[[3]][2]))
res4 = cbind(t(employeesList[[4]][1]), t(employeesList[[4]][2]))

Then

library(dplyr)
bind_rows(list(res1, res2, res3, res4))

But I may have a very large number of data frames -- res1, …, resn. The number of columns is unspecified in advance, but will likely be fewer than 10. My procedure does not name the columns and I think that is needed for bind_rows.

user2738483
  • 131
  • 1
  • 2
  • 10

4 Answers4

1

We can use lapply and convert the list into a single row data frame using and then use bind_rows to bind multiple data frames together.

library(dplyr)
bind_rows(lapply(employeesList, function(x) rbind.data.frame(c(t(x)))))

#   X.Al. X.Jones. X.Barb. X.Smith. X.Carol. X.Adams.
#1    Al    Jones    <NA>     <NA>     <NA>     <NA>
#2    Al    Jones    Barb    Smith     <NA>     <NA>
#3    Al    Jones    Barb    Smith    Carol    Adams
#4    Al    Jones    <NA>     <NA>     <NA>     <NA>

We can rename the columns later using setNames according to our preference.

Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • 1
    I decided to time test the three methods. This method was the clear winner. Using this link, I created lists of data frames. The data frames had one, two, or three rows. There were 10,000 data frames of each of the three types. https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames Method 1 dplyr # user system elapsed # 44.94 0.00 45.03 Method 2 data.table # user system elapsed # 225.36 4.68 229.61 Method 3 tidyverse # user system elapsed # 513.72 0.23 519.52 – user2738483 Jun 15 '18 at 21:56
1

May be this using data.table

library('data.table')

rbindlist( l = lapply( employeesList, function(x) {
                  dcast( data    = melt( setDT( x ), measure.vars = c( 'first', 'second'))[, V1 := seq_along(value), by = variable][],
                         formula = " . ~ variable + V1")[, -1]
                }), 
           fill = TRUE, 
           use.names = TRUE )

#    first_1 second_1 first_2 second_2 first_3 second_3
# 1:      Al    Jones      NA       NA      NA       NA
# 2:      Al    Jones    Barb    Smith      NA       NA
# 3:      Al    Jones    Barb    Smith   Carol    Adams
# 4:      Al    Jones      NA       NA      NA       NA
Sathish
  • 11,525
  • 3
  • 34
  • 55
1

Here is one option with gather/spread

library(tidyverse)
employeesList %>% 
   map_df(~ .x %>% 
               mutate_all(as.character) %>% # convert columns to character class
               mutate(n = row_number(), n = replace(n, n==1, "")),
             .id = 'grp') %>% 
  group_by(grp) %>%
  gather(key, val, first:second) %>% # gather to long format
  arrange(grp, n) %>% 
  unite(keyn, key, n, sep="") %>% # unite columns to create new column
  ungroup %>% 
  mutate(keyn = factor(keyn, levels = unique(keyn))) %>% # for column order
  spread(keyn, val) %>% # spread to wide format
  select(-grp)
# A tibble: 4 x 6
#  first second first2 second2 first3 second3
#  <chr> <chr>  <chr>  <chr>   <chr>  <chr>  
#1 Al    Jones  NA     NA      NA     NA     
#2 Al    Jones  Barb   Smith   NA     NA     
#3 Al    Jones  Barb   Smith   Carol  Adams  
#4 Al    Jones  NA     NA      NA     NA     
akrun
  • 674,427
  • 24
  • 381
  • 486
0

I made a mistake when I submitted my example data. It was not general enough in two different ways. The column names can vary inconsistently and the data can vary much more than indicated. I then asked my question at r-help. There the question was answered in multiple ways. Below are the solutions created by others along with my timing study.

# input data (list of data frames and data frames may have multiple rows)
employees4List = list(data.frame(first1 = "Al", second1 =
                                   "Jones"),
                      data.frame(first2 = c("Al2", "Barb"),
                                 second2 = c("Jones", "Smith")),
                      data.frame(first3 = c("Al3", "Barbara",
                                            "Carol"),
                                 second3 = c("Jones", "Smith",
                                             "Adams")),
                      data.frame(first4 = ("Al"), second4 =
                                   "Jones2"))
employees4List

# intermediate step (list of data frames with each just one row)
df1 = data.frame(First1 = "Al", Second1 = "Jones",
                 First2 = NA, Second2 = NA,
                 First3 = NA, Second3 = NA,
                 First4 = NA, Second4 = NA)
df2 = data.frame(First1 = "Al2", Second1 = "Jones",
                 First2 = "Barb", Second2 = "Smith",
                 First3 = NA, Second3 = NA,
                 First4 = NA, Second4 = NA)
df3 = data.frame(First1 = "Al3", Second1 = "Jones",
                 First2 = "Barbara", Second2 = "Smith",
                 First3 = "Carol", Second3 = "Adams",
                 First4 = NA, Second4 = NA)
df4 = data.frame(First1 = "Al", Second1 = "Jones2",
                 First2 = NA, Second2 = NA,
                 First3 = NA, Second3 = NA,
                 First4 = NA, Second4 = NA)
listFinal = list(df1, df2, df3, df4)
listFinal

# Expected final step, except that all columns should be character
# Just one data frame
dplyr::bind_rows(listFinal)
sapply(dplyr::bind_rows(listFinal), class)

# Solution 1 using base R by Sarah Goslee

dfbycol <- function(x) {
  x <- lapply(x, function(y)as.vector(t(as.matrix(y))))
  x <- lapply(x, function(y){length(y) <- max(sapply(x, length)); y})
  x <- do.call(rbind, x)
  x <- data.frame(x, stringsAsFactors=FALSE)
  colnames(x) <- paste0(c("first", "last"), rep(seq(1, ncol(x)/2), each=2))
  x
}

dfbycol(listFinal)

##########
# Solution 2 by Jeff Newmiller (Base R)

myrename2 <- function( DF, m ) {
  # if a pair of columns is not present, raise an error
  stopifnot( 2 == length( DF ) )
  n <- nrow( DF )
  # use memory layout of elements of matrix
  # t() automatically converts to matrix (nrow=2)
  # matrix(,nrow=1) re-interprets the column-major output of t()
  # as a single row matrix
  result <- as.data.frame( matrix( t( DF ), nrow = 1 )
                           , stringsAsFactors = FALSE
  )
  if ( n < m ) {
    result[ , seq( 2 * n + 1, 2 * m ) ] <- NA
  }
  setNames( result
            , sprintf( "%s%d"
                       , c( "First", "Second" )
                       , rep( seq.int( m ), each = 2 )
            )
  )
}

m <- max( unlist( lapply( employees4List, nrow ) ) )
listFinal2 <- lapply( employees4List, myrename2, m = m )
listFinal2

result2 <- do.call( rbind, listFinal2 )
result2

##########
# Solution 3 by Jeff Newmiller (uses dplyr)
myrename3 <- function( DF ) {
  # if a pair of columns is not present, raise an error
  stopifnot( 2 == length( DF ) )
  n <- nrow( DF )
  # use memory layout of elements of matrix
  # t() automatically converts to matrix (nrow=2)
  # matrix(,nrow=1) re-interprets the column-major output of t()
  # as a single row matrix
  setNames( as.data.frame( matrix( t( DF ), nrow = 1 )
                           , stringsAsFactors = FALSE
  )
  , sprintf( "%s%d"
             , c( "First", "Second" )
             , rep( seq.int( n ), each = 2 )
  )
  )
}

listFinal3 <- lapply( employees4List, myrename3 )
listFinal3
result3 <- dplyr::bind_rows( listFinal3 )
result3

# Solution 4 by Jeff Newmiller (uses dplyr and tidyr)

library(dplyr)
library(tidyr)
myrename4 <- function( DF ) {
  # if a pair of columns is not present, raise an error
  stopifnot( 2 == length( DF ) )
  names( DF ) <- c( "a", "b" )
  m <- nrow( DF )
  (  DF
    %>% mutate_all( as.character )
    %>% mutate( rw = LETTERS[ seq.int( n() ) ] )
    %>% gather( col, val, -rw )
    %>% tidyr::unite( "labels", rw, col, sep="" )
    %>% spread( labels, val )
    %>% setNames( sprintf( "%s%d"
                           , c( "First", "Second" )
                           , rep( seq.int( m ), each = 2 )
    )
    )
  )
}

listFinal4 <- lapply( employees4List, myrename3)
listFinal4
result4 <- dplyr::bind_rows(listFinal4)
result4

#####
# Timing
# Create a large dataset
firsts = c("Al", "Barb", "Carol")
seconds = c("Washington", "Adams", "Jefferson" )
numReplications = 10000
set.seed(2018)

# Create data frames
sim_list1 = replicate(n = numReplications,
                      expr = {data.frame(first = base::sample(x = firsts, size = 1, replace = TRUE),
                                         second = base::sample(x = seconds, size = 1, replace = TRUE))},
                      simplify = F)

sim_list2 = replicate(n = numReplications,
                      expr = {data.frame(first = base::sample(x = firsts, size = 2, replace = TRUE),
                                         second = base::sample(x = seconds, size = 2, replace = TRUE))},
                      simplify = F)

sim_list3 = replicate(n = numReplications,
                      expr = {data.frame(first = base::sample(x = firsts, size = 3, replace = TRUE),
                                         second = base::sample(x = seconds, size = 3, replace = TRUE))},
                      simplify = F)

# Create list
employeesList = c(sim_list1, sim_list2, sim_list3)

# Method 1

system.time(res1 <- dfbycol(employeesList))
# > system.time(dfbycol(employeesList))
# user  system elapsed 
# 757.87    0.18  758.62 
# res1
rm(res1)

#####
# Method 2

system.time(m <- max( unlist( lapply( employeesList, nrow ) ) ))
#    user  system elapsed 
#    0.22    0.00    0.22

system.time(listFinal2 <- lapply( employeesList, myrename2, m = m ) )
listFinal2
# user  system elapsed 
# 16.16    0.01   16.18 

system.time(result2 <- do.call( rbind, listFinal2 ) )
# result2
# user  system elapsed 
# 3.96    0.00    3.96
rm(listFinal2)
rm(result2)

#####
# Method 3

system.time(listFinal3 <- lapply( employeesList, myrename3))
# user  system elapsed 
# 7.33    0.00    7.33
listFinal3
system.time(result3 <- dplyr::bind_rows( listFinal3 ))
# user  system elapsed 
# 0.17    0.00    0.17
rm(listFinal3)
rm(result3)

#####
# Method 4

system.time(listFinal4 <- lapply( employeesList, myrename4) )
# user  system elapsed 
# 400.05    0.04  400.24 
listFinal4
system.time(result4 <- dplyr::bind_rows( listFinal4 ) )
#    user  system elapsed 
#   0.17    0.00    0.17 
# result4
user2738483
  • 131
  • 1
  • 2
  • 10