Gather duplicate column sets into single columns

Question

The problem of gathering multiple sets of columns was already addressed here: Gather multiple sets of columns, but in my case, the columns are not unique.

I have the following data:

input <- data.frame(
  id = 1:2, 
  question = c("a", "b"),
  points = 0,
  max_points = c(3, 5),
  question = c("c", "d"),
  points = c(0, 20),
  max_points = c(5, 20),
  check.names = F,
  stringsAsFactors = F
)
input
#>   id question points max_points question points max_points
#> 1  1        a      0          3        c      0          5
#> 2  2        b      0          5        d     20         20

The first column is an id, then I have many repeated columns (the original dataset has 133 columns):

identifier for question
points given
maximum points

I would like to end up with this structure:

expected <- data.frame(
  id = c(1, 2, 1, 2),
  question = letters[1:4],
  points = c(0, 0, 0, 20),
  max_points = c(3, 5, 5, 20),
  stringsAsFactors = F
)
expected
#>   id question points max_points
#> 1  1        a      0          3
#> 2  2        b      0          5
#> 3  1        c      0          5
#> 4  2        d     20         20

I have tried several things:

tidyr::gather(input, key, val, -id)
reshape2::melt(input, id.vars = "id")

Both do not deliver the desired output. Furthermore, with more columns than shown here, gather doesn't work any more, because there are too many duplicate columns.

As a workaround I tried this:

# add numbers to make col headers "unique"
names(input) <- c("id", paste0(1:(length(names(input)) - 1), names(input)[-1]))

# gather, remove number, spread
input %>% 
  gather(key, val, -id) %>%
  mutate(key = stringr::str_replace_all(key, "[:digit:]", "")) %>%
  spread(key, val)

which gives an error: Duplicate identifiers for rows (3, 9), (4, 10), (1, 7), (2, 8)

This problem was already discussed here: Unexpected behavior with tidyr, but I don't know why/how I should add another identifier. Most likely this is not the main problem, because I probably should approach the whole thing differently.

How could I solve my problem, preferably with tidyr or base? I don't know how to use data.table, but in case there is a simple solution, I will settle for that too.

Are all your question, max_points, and points columns actually named the same thing? — Mike H., Jun 29 '16 at 12:43
@zx8754 As I said, I have 133 cols in total, so I'd rather not do it manually. — Thomas K, Jun 29 '16 at 12:46
Yes, I understand that, it was just a hint, indices can be computed. — zx8754, Jun 29 '16 at 12:47

score 5 · Accepted Answer · answered Jun 29 '16 at 12:52

Try this:

do.call(rbind,
        lapply(seq(2, ncol(input), 3), function(i){
          input[, c(1, i:(i + 2))]
              })
        )

#   id question points max_points
# 1  1        a      0          3
# 2  2        b      0          5
# 3  1        c      0          5
# 4  2        d     20         20

score 4 · Answer 2 · answered Jun 29 '16 at 14:04

The idiomatic way to do this in data.table is pretty simple:

library(data.table)
setDT(input)

res = melt(
  input, 
  id = "id", 
  meas = patterns("question", "^points$", "max_points"), 
  value.name = c("question", "points", "max_points")
)


   id variable question points max_points
1:  1        1        a      0          3
2:  2        1        b      0          5
3:  1        2        c      0          5
4:  2        2        d     20         20

You get the extra column called "variable", but you can get rid of it with res[, variable := NULL] afterwards if desired.

score 1 · Answer 3 · answered Jun 29 '16 at 13:14

Another way to accomplish the same goal without using lapply:

We start by grabbing all the columns for question, max_points, and points then we melt each one individually and cbind them all together.

library(reshape2)

questions <- input[,c(1,c(1:length(names(input)))[names(input)=="question"])]
points <- input[,c(1,c(1:length(names(input)))[names(input)=="points"])]
max_points <- input[,c(1,c(1:length(names(input)))[names(input)=="max_points"])]

questions_m <- melt(questions,id.vars=c("id"),value.name = "questions")[,c(1,3)]
points_m <- melt(points,id.vars=c("id"),value.name = "points")[,3,drop=FALSE]
max_points_m <- melt(max_points,id.vars=c("id"),value.name = "max_points")[,3, drop=FALSE]

res <- cbind(questions_m,points_m, max_points_m)
res
  id questions points max_points
1  1         a      0          3
2  2         b      0          5
3  1         c      0          5
4  2         d     20         20

score 0 · Answer 4 · answered Jun 29 '16 at 12:53

You might need to clarify how you want the ID column to be handled but perhaps something like this ?

runme <- function(word , dat){
     grep( paste0("^" , word , "$") , names(dat)) 
}

l <- mapply( runme ,  unique(names(input)) , list(input) )
l2 <- as.data.frame(l)

output <- data.frame()
for (i in 1:nrow(l2)) output <- rbind( output , input[,  as.numeric(l2[i,])  ])

Not sure how robust it is with respect to handling different numbers of repeated columns but it works for your test data and should work if you columns are repeated equal numbers of times.

Gather duplicate column sets into single columns

4 Answers4