1

I have the following structure:

key | category_x | 2009 | category_y | 2010
test

example data as requested

set.seed(24)
df <- data.frame(
key = 1:10,
category_x = paste0("stock_", 0:9),
'2008' = rnorm(10, 0, 10),
category_y = paste0("stock_", 0:9),
'2009' = rnorm(10, 0, 10),
category_z = paste0("stock_", 0:9),
'2010' = rnorm(10, 0, 10),
check.names=FALSE
)

how do I change that into:

key | category | year

I know I can use:

library(magrittr)
library(dplyr)
library(tidyr)

data %>% gather(key, category, starts_with("category_"))

but that doesn't deal with the year. I looked at Gather multiple sets of columns

but I don't get the extract spread commands.

Community
  • 1
  • 1
KillerSnail
  • 2,637
  • 5
  • 39
  • 59

1 Answers1

2

If we are using gather, we can do this in two steps. First, we reshape from 'wide' to 'long' format for the column names that starts with 'category' and in the next step, we do the same with the numeric column names by selecting with matches. The matches can regex patterns, so a pattern of ^[0-9]+$ means we match one or more numbers ([0-9]+) from the start (^) to the end ($) of string. We can remove the columns that are not needed with select.

library(tidyr)
library(dplyr) 
gather(df, key, category, starts_with('category_')) %>%
     gather(key2, year, matches('^[0-9]+$')) %>%
     select(-starts_with('key'))

Or using the devel version of data.table, this would be much easier as the melt can take multiple patterns for measure columns. We convert the 'data.frame' to 'data.table' (setDT(df)), use melt and specify the patterns with in the measure argument. We also have options to change the column names of the 'value' column. The 'variable' column is set to NULL as it was not needed in the expected output.

library(data.table)#v1.9.5+
melt(setDT(df), measure=patterns(c('^category', '^[0-9]+$')), 
           value.name=c('category', 'year'))[, variable:=NULL][]
akrun
  • 674,427
  • 24
  • 381
  • 486