7

I have a large data frame with a factor column that I need to divide into three factor columns by splitting up the factor names by a delimiter. Here is my current approach, which is very slow with a large data frame (sometimes several million rows):

data <- readRDS("data.rds")
data.df <- reshape2:::melt.array(data)
head(data.df)
##  Time Location    Class Replicate Population
##1    1        1 LIDE.1.S         1 0.03859605
##2    2        1 LIDE.1.S         1 0.03852957
##3    3        1 LIDE.1.S         1 0.03846853
##4    4        1 LIDE.1.S         1 0.03841260
##5    5        1 LIDE.1.S         1 0.03836147
##6    6        1 LIDE.1.S         1 0.03831485

Rprof("str.out")
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
Rprof(NULL)

head(data.df)
##  Time Location Species SizeClass Infected Replicate Population
##1    1        1    LIDE         1        S         1 0.03859605
##2    2        1    LIDE         1        S         1 0.03852957
##3    3        1    LIDE         1        S         1 0.03846853
##4    4        1    LIDE         1        S         1 0.03841260
##5    5        1    LIDE         1        S         1 0.03836147
##6    6        1    LIDE         1        S         1 0.03831485

summaryRprof("str.out")

$by.self
                 self.time self.pct total.time total.pct
"strsplit"            1.34    50.00       1.34     50.00
"<Anonymous>"         1.16    43.28       1.16     43.28
"do.call"             0.04     1.49       2.54     94.78
"unique.default"      0.04     1.49       0.04      1.49
"data.frame"          0.02     0.75       0.12      4.48
"is.factor"           0.02     0.75       0.02      0.75
"match"               0.02     0.75       0.02      0.75
"structure"           0.02     0.75       0.02      0.75
"unlist"              0.02     0.75       0.02      0.75

$by.total
                       total.time total.pct self.time self.pct
"do.call"                    2.54     94.78      0.04     1.49
"strsplit"                   1.34     50.00      1.34    50.00
"<Anonymous>"                1.16     43.28      1.16    43.28
"cbind"                      0.14      5.22      0.00     0.00
"data.frame"                 0.12      4.48      0.02     0.75
"as.data.frame.matrix"       0.08      2.99      0.00     0.00
"as.data.frame"              0.08      2.99      0.00     0.00
"as.factor"                  0.08      2.99      0.00     0.00
"factor"                     0.06      2.24      0.00     0.00
"unique.default"             0.04      1.49      0.04     1.49
"unique"                     0.04      1.49      0.00     0.00
"is.factor"                  0.02      0.75      0.02     0.75
"match"                      0.02      0.75      0.02     0.75
"structure"                  0.02      0.75      0.02     0.75
"unlist"                     0.02      0.75      0.02     0.75
"[.data.frame"               0.02      0.75      0.00     0.00
"["                          0.02      0.75      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 2.68

Is there any way to speed up this operation? I note that there are a small (<5) number of each of the categories "Species", "SizeClass", and "Infected", and I know what these are in advance.

Notes:

  • stringr::str_split_fixed performs this task, but not any faster
  • The data frame is actually initially generated by calling reshape::melt on an array in which Class and its associated levels are a dimension. If there's a faster way to get from there to here, great.
  • data.rds at http://dl.getdropbox.com/u/3356641/data.rds
smci
  • 26,085
  • 16
  • 96
  • 138
Noam Ross
  • 4,809
  • 4
  • 20
  • 36

3 Answers3

6

This should probably offer quite an increase:

library(data.table)
DT <- data.table(data.df)


DT[, c("Species", "SizeClass", "Infected") 
      := as.list(strsplit(Class, "\\.")[[1]]), by=Class ]

The reasons for the increase:

  1. data.table pre allocates memory for columns
  2. every column assignment in data.frame reassigns the entirety of the data (data.table in contrast does not)
  3. the by statement allows you to implement the strsplit task once per each unique value.

Here is a nice quick method for the whole process.

# Save the new col names as a character vector 
newCols <- c("Species", "SizeClass", "Infected") 

# split the string, then convert the new cols to columns
DT[, c(newCols) := as.list(strsplit(as.character(Class), "\\.")[[1]]), by=Class ]
DT[, c(newCols) := lapply(.SD, factor), .SDcols=newCols]

# remove the old column. This is instantaneous. 
DT[, Class := NULL]

## Have a look: 
DT[, lapply(.SD, class)]
#       Time Location Replicate Population Species SizeClass Infected
# 1: integer  integer   integer    numeric  factor    factor   factor

DT
Ricardo Saporta
  • 51,025
  • 13
  • 129
  • 166
  • That is fast! Though you ned to use `as.character(Class)`. Can you return the columns as factors in the same command? – Noam Ross May 20 '13 at 01:15
  • You can convert to factor, but do it as a second line after. Using as.factor in the same call that includes a 'by' argument necessarily slows down the process. – Ricardo Saporta May 20 '13 at 01:28
  • @NoamRoss, nice catch on the `as.character`. Updated code plus a few extra steps – Ricardo Saporta May 20 '13 at 01:39
  • Looking at the answer from @SchaunW, your best increase will probably come from combining data.table with SchaunW's answer. – Ricardo Saporta May 20 '13 at 01:52
  • I think (3) is probably the main reason for the speedup in this case. – hadley May 20 '13 at 16:37
  • @hadley, I would guess its #2. Without profiling, hard to be sure. If I find some downtime ill test it out. – Ricardo Saporta May 20 '13 at 17:31
  • 1
    @RicardoSaporta I think you're overestimating how slow adding a new column to a dataframe is.`f – hadley May 20 '13 at 19:26
  • @Hadley, you're probably right about (3) being the main reason for the speedup (especially re-reading the OP in that there are only a few unique values). However, #2 certainly has a real cost when the data is substantial. `data.df` right now is less than 2KB. Try the same process with, say, a large subset of the flight data. If the data is > 1GB, creating a column will probably be 1 or 2 seconds. Then try it again with `data.table`. I would guess you would cut that down by about 75~85%. – Ricardo Saporta May 20 '13 at 21:22
  • 1
    @RicardoSaporta I'm not disagreeing that data.table is faster, but it's unlikely in general that creating columns is the bottleneck (unless you're doing it in a loop in a naive way) – hadley May 20 '13 at 21:31
3

You could get a decent increase in speed by just extracting the parts of the string you need using gsub instead of splitting everything up and trying to put it back together:

data <- readRDS("~/Downloads/data.rds")
data.df <- reshape2:::melt.array(data)

# using `strsplit`
system.time({
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
})

user  system elapsed 
3.349   0.062   3.411 

#using `gsub`
system.time({
data.df$Class <- as.character(data.df$Class)
data.df$SizeClass <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\2", data.df$Class,
  perl = TRUE)
data.df$Infected  <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\3", data.df$Class, 
  perl = TRUE)
data.df$Class  <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\1", data.df$Class, 
  perl = TRUE)
})

user  system elapsed 
0.812   0.037   0.848 
SchaunW
  • 3,458
  • 1
  • 19
  • 21
2

Looks like you have a factor, so work on the levels and then map back. Use fixed=TRUE in strsplit, adjusting to split=".".

Classes <- do.call(rbind, strsplit(levels(data.df$Class), ".", fixed=TRUE))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
df0 <- as.data.frame(Classes[data.df$Class,], row.names=NA)
cbind(data.df, df0)
Martin Morgan
  • 43,010
  • 5
  • 72
  • 104
  • 1
    Actually a great answer - simple and adds no new dependencies. But the join can be slow with a big data.frame, too, so Ricardo's `data.frame` construct is a better solution for me. – Noam Ross May 20 '13 at 04:25
  • Oops, meant `data.table` construct, but I can't seem to edit the comment after this long. – Noam Ross May 27 '13 at 17:00
  • This suggestion sped up my code by an order of magnitude! Thanks! – CephBirk Nov 11 '16 at 13:35