815

Overview

I'm relatively familiar with data.table, not so much with dplyr. I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that:

  1. data.table and dplyr are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
  2. dplyr has more accessible syntax
  3. dplyr abstracts (or will) potential DB interactions
  4. There are some minor functionality differences (see "Examples/Usage" below)

In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).

Question

What I want to know is:

  1. Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
  2. Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr would offer much beyond what I can already do in data.table. Here is the dplyr solution (data at end of Q):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a data.table solution. That said, good data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table (i.e. doesn't use some of the more esoteric tricks).

Ideally what I'd like to see is some good examples were the dplyr or data.table way is substantially more concise or performs substantially better.

Examples

Usage
  • dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using do in the answer to @eddi's question).
  • data.table supports rolling joins (thanks @dholstius) as well as overlap joins
  • data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.
  • dplyr offers standard evaluation versions of functions (e.g. regroup, summarize_each_) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)
Benchmarks

Data

This is for the first example I showed in the question section.

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))
BrodieG
  • 48,306
  • 7
  • 80
  • 131
  • 10
    The solution that's similar in reading to the `dplyr` one is: `as.data.table(dat)[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], by = list(name, job)]` – eddi Jan 29 '14 at 16:12
  • 7
    For #1 both `dplyr` and `data.table` teams are working on benchmarks, so an answer will be there at some point. #2 (syntax) imO is strictly false, but that clearly ventures into opinion territory, so I'm voting to close as well. – eddi Jan 29 '14 at 16:16
  • @eddi, but are there problems that can be expressed cleanly in one but no the other? Or is it strictly a manner of style in your opinion? – BrodieG Jan 29 '14 at 16:23
  • 13
    well, again imO, the set of problems that are more cleanly expressed in `(d)plyr` has measure 0 – eddi Jan 29 '14 at 16:26
  • @eddi, that's my general take as well, though I don't know `dplyr` well enough to rule it out and was curious if someone would come up with a good counterexample. Also, nice simplification of the `data.table` formulation. – BrodieG Jan 29 '14 at 16:29
  • This is very similar to http://stackoverflow.com/q/16153947/892313 which asked about `plyr` and `data.table` and was also closed. [My answer there](http://stackoverflow.com/a/16154157/892313) explains why this more. – Brian Diggs Jan 29 '14 at 16:46
  • 4
    @eddi IMHO the set of problems more cleanly expressed in `data.table` has measure 0, but that's not inconsistent with your belief ;) – hadley Jan 29 '14 at 19:15
  • 3
    @hadley :) it's only consistent if you think that neither covers a non-zero-measure set of problems; on a more serious note, I think a vignette that covers 99% of the group/summarize/etc questions that appear on SO (which are all very close reincarnations of about 5-10 distinct problems), with all the different implementations and their benchmarks is long overdue – eddi Jan 29 '14 at 19:28
  • @eddi it's also consistent if you believe that data.table and dplyr are equally expressive for every problem. I don't think benchmarks are that interesting, but from my perspective http://cran.r-project.org/web/packages/dplyr/vignettes/benchmark-baseball.html covers most interesting (to me) problem types. – hadley Jan 29 '14 at 20:48
  • 1
    @hadley yeah that probably does cover majority of the problems. Two quick comments - no `data.table` joins at the end? And (probably repeating the obvious stated elsewhere) what about more interesting data sizes (where I'd define "interesting" as something that takes more than say a minute in base - which is generally the main reason people start exploring benchmarks of packages)? – eddi Jan 29 '14 at 21:20
  • 30
    @BrodieG the one thing that really bugs me about both `dplyr` and `plyr` with regards to syntax and is basically the main reason why I dislike their syntax, is that I have to learn way too many (read more than 1) extra functions (with names that *still* don't make sense for me), remember what they do, what arguments they take, etc. That has always been a huge turn off for me from plyr-philosophy. – eddi Jan 29 '14 at 21:48
  • @eddi I just got tired of writing data.table conversions. I'd happily accept a pull-request that added the data.table equivalents. I've stuck with small data sizes since the majority of the algorithms are O(n), but I'd be happy to see counter examples. – hadley Jan 29 '14 at 22:51
  • 45
    @eddi [tongue-in-cheek] the one thing that really bugs me about data.table syntax is that I have to learn how way too many function arguments interact, and what cryptic shortcuts mean (e.g. `.SD`). [seriously] I think these are legitimate design differences that will appeal to different people – hadley Jan 29 '14 at 22:53
  • 7
    @hadley re `.SD` et al - that's fair - `.SD` took me a little while to understand, but by the time I got there, I was already able to do a lot, while (d)plyr presents you with a large barrier right upfront. – eddi Jan 29 '14 at 23:18
  • This seems to be an example where data.table (currently) provides the simpler solution: http://stackoverflow.com/questions/21295936/can-dplyr-summarise-over-several-variables-without-listing-each-one – Vincent Jan 31 '14 at 05:06
  • Here [http://stackoverflow.com/questions/21477525/fast-frequency-and-percentage-table-with-dplyr/][1] is one example comparing `dplyr` and `data.table`. – danas.zuokas Jan 31 '14 at 13:02
  • 1
    [Here](https://gist.github.com/arunsrinivasan/db6e1ce05227f120a2c9) is an aspect of comparison I just did to give additional perspective to recent tweets on joins on dplyr/data.table. HTH. – Arun May 07 '14 at 14:51
  • @Arun, interesting. I made a note of this in the post. – BrodieG May 07 '14 at 15:51
  • 1
    Thought this question had been closed as not allowed on S.O.? If it is going to be maintained here then how about make it a community wiki - I'm all for that. There's quite a bit missing from the comparison currently; e.g. ordered joins. Are we allowed to add answers? – Matt Dowle Oct 24 '14 at 18:37
  • @MattDowle, it was closed, but someone re-opened it. Presumably answers are allowed? Certainly can be made into a community wiki (not sure how). – BrodieG Oct 24 '14 at 19:04
  • Rolling joins. data.table does them. dplyr does not (yet). – dholstius Nov 05 '14 at 03:51
  • @dholstius, now noted, thanks for pointing it out, I hadn't thought of that. It also led me to discover the implementation of overlap joins. – BrodieG Nov 05 '14 at 19:24
  • 5
    My advice for some one starting with one of them like me, I would recommend learn `data.table` it is much less verbose than standard data frame and most of the operations are faster. It forces you to think with vectorial mindset (that has better performance). I see it as a good substitute of data frame. The syntax it not intuitive at first, but once you get to use it it is something easy to remember. I see `dplyr`as a set of functions but `data.table`as a new class object with much better performance and concise syntax that the traditional data frame. – David Leal Mar 15 '17 at 04:15
  • @eddi Why are you writing all that expression in the "j" and not some in the "i" side? – skan Jul 16 '17 at 20:13
  • @skan I don't remember the context of this anymore, presumably it was to achieve similar reading. – eddi Jul 17 '17 at 18:36

4 Answers4

575

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.


The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas. See also updated benchmarks, which include Spark and pydatatable as well.

On benchmarks, it would be great to cover these remaining aspects as well:

  • Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z] type operations.

  • Benchmark other operations such as update and joins.

  • Also benchmark memory footprint for each operation in addition to runtime.

2. Memory usage

  1. Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.

    Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.

  2. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).

     # sub-assign by reference, updates 'y' in-place
     DT[x >= 1L, y := NA]
    

    But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):

     # copies the entire 'y' column
     ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
    

    A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.

    Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:

     foo <- function(DT) {
         DT = shallow(DT)          ## shallow copy DT
         DT[, newcol := 1L]        ## does not affect the original DT 
         DT[x > 2L, newcol := 2L]  ## no need to copy (internally), as this column exists only in shallow copied DT
         DT[x > 2L, x := 3L]       ## have to copy (like base R / dplyr does always); otherwise original DT will 
                                   ## also get modified.
     }
    

    By not using shallow(), the old functionality is retained:

     bar <- function(DT) {
         DT[, newcol := 1L]        ## old behaviour, original DT gets updated by reference
         DT[x > 2L, x := 3L]       ## old behaviour, update column x in original DT.
     }
    

    By creating a shallow copy using shallow(), we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.

    Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.

    But it will still lack many features that data.table provides, including (sub)-assignment by reference.

  3. Aggregate while joining:

    Suppose you have two data.tables as follows:

     DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
     #    x y z
     # 1: 1 a 1
     # 2: 1 a 2
     # 3: 1 b 3
     # 4: 1 b 4
     # 5: 2 a 5
     # 6: 2 a 6
     # 7: 2 b 7
     # 8: 2 b 8
     DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
     #    x y mul
     # 1: 1 a   4
     # 2: 2 b   3
    

    And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y. We can either:

      1. aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)

        data.table way

        DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]

        dplyr equivalent

        DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul)

      1. do it all in one go (using by = .EACHI feature):

        DT1[DT2, list(z=sum(z) * mul), by = .EACHI]

    What is the advantage?

    • We don't have to allocate memory for the intermediate result.

    • We don't have to group/hash twice (one for aggregation and other for joining).

    • And more importantly, the operation what we wanted to perform is clear by looking at j in (2).

    Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.

    Have a look at this, this and this posts for real usage scenarios.

    In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

  4. Update and joins:

    Consider the data.table code shown below:

     DT1[DT2, col := i.mul]
    

    adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

    Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
  1. Basic aggregation/update operations.

     # case (a)
     DT[, sum(y), by = z]                       ## data.table syntax
     DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
     DT[, y := cumsum(y), by = z]
     ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
    
     # case (b)
     DT[x > 2, sum(y), by = z]
     DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
     DT[x > 2, y := cumsum(y), by = z]
     ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
    
     # case (c)
     DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
     DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
     DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
     DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
    
    • data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).

    • In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.

      This is what we mean when we say the DT[i, j, by] form is consistent.

    • Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.

      While it returns the same result, using filter() here makes the actual operation less obvious.

      It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

  2. Aggregation / update on multiple columns

     # case (a)
     DT[, lapply(.SD, sum), by = z]                     ## data.table syntax
     DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
     DT[, (cols) := lapply(.SD, sum), by = z]
     ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
    
     # case (b)
     DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
     DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
    
     # case (c)
     DT[, c(.N, lapply(.SD, sum)), by = z]     
     DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
    
    • In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().

    • data.table's := requires column names to be provided, whereas dplyr generates it automatically.

    • In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.

    • In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

    Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

    You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

  3. Joins

    dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

     setkey(DT1, x, y)
    
     # 1. normal join
     DT1[DT2]            ## data.table syntax
     left_join(DT2, DT1) ## dplyr syntax
    
     # 2. select columns while join    
     DT1[DT2, .(z, i.mul)]
     left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
    
     # 3. aggregate while join
     DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
     DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
         inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
    
     # 4. update while join
     DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
     ??
    
     # 5. rolling join
     DT1[DT2, roll = -Inf]
     ??
    
     # 6. other arguments to control output
     DT1[DT2, mult = "first"]
     ??
    
  • Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.

  • However dplyr joins do just that. Nothing more. Nothing less.

  • data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.

  • data.tables can aggregate while joining using by = .EACHI feature (3) and also update while joining (4). Why materialize the entire join result to add/update just a few columns?

  • data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.

  • data.table also has mult = argument which selects first, last or all matches (6).

  • data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

  1. do()...

    dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

     DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
     DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
     DT[, list(x[1:2], y[1]), by = z]
     DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
    
     DT[, quantile(x, 0.25), by = z]
     DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
     DT[, quantile(x, c(0.25, 0.75)), by = z]
     DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
    
     DT[, as.list(summary(x)), by = z]
     DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
    
  • .SD's equivalent is .

  • In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.

  • In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

  • fread - fast file reader has been available for a long time now.

  • fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.

  • Automatic indexing - another handy feature to optimise base R syntax as is, internally.

  • Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.

  • Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.

  • Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.

  • Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.

  • setorder() function in data.table that allows really fast reordering of data.tables by reference.

  • dplyr provides interface to databases using the same syntax, which data.table does not at the moment.

  • data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).

  • data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.


Finally:

  • On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.

  • On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).

    • Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.
Michael
  • 5,283
  • 4
  • 27
  • 37
Arun
  • 108,644
  • 21
  • 263
  • 366
  • ... So what you're doing (somewhat implicitly) is a subset of rows and that's what `filter` is made for, in dplyr. – talat Dec 31 '14 at 14:09
  • @docendodiscimus, sorry but what's the idiomatic way of doing `DT[, if(any(x>1)) y[1], by=z]` again please? I'm confused now. Should or shouldn't I use filter here? – Arun Dec 31 '14 at 14:21
  • [`summarise()` doesn't have list columns yet](https://github.com/hadley/dplyr/issues/832), but `mutate()` does, and the same issue with `.` exists (yes, it's not idiomatic, but a bug is a bug). – Arun Dec 31 '14 at 14:24
  • You should use a filter, since you are subsetting rows. (In your example, you don't remove any of the z groups since they both have `any(x > 1) == TRUE`. So most likely you'd want to do: `DF %>% group_by(z) %>% filter(any(x > 1)) %>% summarise(y = y[1])`. – talat Dec 31 '14 at 14:25
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/67990/discussion-on-answer-by-arun-data-table-vs-dplyr-can-one-do-something-well-the). – Taryn Dec 31 '14 at 14:40
  • 9
    @bluefeet: I don't think you did the rest of us any great service by moving that discussion to chat. I was under the impression that Arun was one of the developers and this might have resulted in useful insights. – IRTFM Jan 05 '15 at 21:20
  • @BondedDust Problem was that there were 30+ comments on this post making it incredibly difficult to find anything that would be helpful in them. All the comments were moved to chat, then they could be incorporated into the answer as needed. Extended discussions shouldn't happen in the comments, that is what chat is for. – Taryn Jan 05 '15 at 21:24
  • 2
    When I went to chat using your link, it appeared that all of the material following the comment starting "You should use a filter".. was gone. Am I missing something about the SO-chat mechanism? – IRTFM Jan 05 '15 at 21:39
  • @BondedDust There were no additional comments here or in chat after the "You should filter..." comment. It appears that no one continued the discussion either here or in chat. – Taryn Jan 06 '15 at 12:45
  • 6
    I think that every where where you are using assignment by reference (`:=`), `dplyr` equivalent should be also using `% mutate...` instead of just `DF %>% mutate...` – David Arenburg Jan 07 '15 at 15:21
  • 4
    Regarding the syntax. I believe the `dplyr` can be easier for users who used to `plyr` syntax, but `data.table` can be easier for users who used to query languages syntax like `SQL`, and the relational algebra behind it, which is all about the tabular data transformation. @Arun you should note that **set operators** are very easy do-able by wrapping `data.table` function and of course brings significant speedup. – jangorecki Jan 14 '15 at 19:33
  • add dplyr inside a function and immediately see exponentially complex code. you should add them to your comparison as well – MySchizoBuddy Jun 10 '16 at 20:14
  • @MySchizoBuddy, maybe a gist to show what you specifically talk about? Is it the SE version you're attributing to? – Arun Jun 15 '16 at 08:55
  • @Arun yes when using dplyr is used inside a function you have to use the SE versions along with lazyeval::interp() Does data.table have similar issues when using inside a function – MySchizoBuddy Jun 16 '16 at 09:53
  • 1
    @MySchizoBuddy, good point. data.table's `.SD` along with `.SDcols` is much more straightforward than `interp()` AFAIU, at least for most scenarios. That together with the use of `get()` and `mget()` (in rare cases) is sufficient in~99% of all scenarios. [data.table needs to implement one more thing in the `i`-argument - better scoping rules, but that's an extremely rare scenario.] I'll think of a way to edit this into the post, if I manage to understand `interp()`. Thanks. – Arun Jun 16 '16 at 11:36
  • @Arun I would very much like data.table to support database backends! Imagine `data.table` syntax on a Redshift backend. Perhaps with some local caching layer to avoid costly repeat queries when possible. Wow. – andrew Aug 08 '18 at 16:55
  • 14
    I've read this post so many times, and it helped me a lot in understanding data.table and being able to use it better. I, for most cases, prefer data.table over dplyr or pandas or PL/pgSQL. However, I couldn't stop thinking of how to express it. The syntax is *not* easy, clear or verbose. In fact, even after I've used data.table a lot, I often still struggle to comprehend my own code, I've written literally a week ago. This is a life example of a write-only language. https://en.wikipedia.org/wiki/Write-only_language So, let's hope, one day we will be able to use dplyr on data.table. – Ufos Dec 18 '18 at 11:48
  • In c) and b) of "1. Basic aggregation/update operations." I am not getting equivalent results from dplyr and dt. – its.me.adam Mar 05 '20 at 18:41
  • 5
    Actually a lot of the dplyr code no longer applies (due to updates?)... This answer could use a refresh as it is such a great resource. – its.me.adam Mar 05 '20 at 20:46
  • 2
    @its.me.adam the problem is that these code has been changing multiple times over time, so updating it is actually pointless because after some time you will need to update again. Looking up manual entries (for the current API) is not that tough. – jangorecki Oct 20 '20 at 08:50
408

Here's my attempt at a comprehensive answer from the dplyr perspective, following the broad outline of Arun's answer (but somewhat rearranged based on differing priorities).

Syntax

There is some subjectivity to syntax, but I stand by my statement that the concision of data.table makes it harder to learn and harder to read. This is partly because dplyr is solving a much easier problem!

One really important thing that dplyr does for you is that it constrains your options. I claim that most single table problems can be solved with just five key verbs filter, select, mutate, arrange and summarise, along with a "by group" adverb. That constraint is a big help when you're learning data manipulation, because it helps order your thinking about the problem. In dplyr, each of these verbs is mapped to a single function. Each function does one job, and is easy to understand in isolation.

You create complexity by piping these simple operations together with %>%. Here's an example from one of the posts Arun linked to:

diamonds %>%
  filter(cut != "Fair") %>%
  group_by(cut) %>%
  summarize(
    AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = n()
  ) %>%
  arrange(desc(Count))

Even if you've never seen dplyr before (or even R!), you can still get the gist of what's happening because the functions are all English verbs. The disadvantage of English verbs is that they require more typing than [, but I think that can be largely mitigated by better autocomplete.

Here's the equivalent data.table code:

diamondsDT <- data.table(diamonds)
diamondsDT[
  cut != "Fair", 
  .(AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = .N
  ), 
  by = cut
][ 
  order(-Count) 
]

It's harder to follow this code unless you're already familiar with data.table. (I also couldn't figure out how to indent the repeated [ in a way that looks good to my eye). Personally, when I look at code I wrote 6 months ago, it's like looking at a code written by a stranger, so I've come to prefer straightforward, if verbose, code.

Two other minor factors that I think slightly decrease readability:

  • Since almost every data table operation uses [ you need additional context to figure out what's happening. For example, is x[y] joining two data tables or extracting columns from a data frame? This is only a small issue, because in well-written code the variable names should suggest what's happening.

  • I like that group_by() is a separate operation in dplyr. It fundamentally changes the computation so I think should be obvious when skimming the code, and it's easier to spot group_by() than the by argument to [.data.table.

I also like that the the pipe isn't just limited to just one package. You can start by tidying your data with tidyr, and finish up with a plot in ggvis. And you're not limited to the packages that I write - anyone can write a function that forms a seamless part of a data manipulation pipe. In fact, I rather prefer the previous data.table code rewritten with %>%:

diamonds %>% 
  data.table() %>% 
  .[cut != "Fair", 
    .(AvgPrice = mean(price),
      MedianPrice = as.numeric(median(price)),
      Count = .N
    ), 
    by = cut
  ] %>% 
  .[order(-Count)]

And the idea of piping with %>% is not limited to just data frames and is easily generalised to other contexts: interactive web graphics, web scraping, gists, run-time contracts, ...)

Memory and performance

I've lumped these together, because, to me, they're not that important. Most R users work with well under 1 million rows of data, and dplyr is sufficiently fast enough for that size of data that you're not aware of processing time. We optimise dplyr for expressiveness on medium data; feel free to use data.table for raw speed on bigger data.

The flexibility of dplyr also means that you can easily tweak performance characteristics using the same syntax. If the performance of dplyr with the data frame backend is not good enough for you, you can use the data.table backend (albeit with a somewhat restricted set of functionality). If the data you're working with doesn't fit in memory, then you can use a database backend.

All that said, dplyr performance will get better in the long-term. We'll definitely implement some of the great ideas of data.table like radix ordering and using the same index for joins & filters. We're also working on parallelisation so we can take advantage of multiple cores.

Features

A few things that we're planning to work on in 2015:

  • the readr package, to make it easy to get files off disk and in to memory, analogous to fread().

  • More flexible joins, including support for non-equi-joins.

  • More flexible grouping like bootstrap samples, rollups and more

I'm also investing time into improving R's database connectors, the ability to talk to web apis, and making it easier to scrape html pages.

Gregor Thomas
  • 104,719
  • 16
  • 140
  • 257
hadley
  • 94,313
  • 27
  • 170
  • 239
  • 28
    Just a side note, I do agree with many of your arguments (although I prefer the `data.table` syntax myself), but you can easily use `%>%` in order to pipe `data.table` operations if you don't like `[` style. `%>%` is not specific to `dplyr`, rather comes from a separate package (which you happen to be the co-author of too), so I'm not sure I understand what are you trying to say in most of your **Syntax** paragraph. – David Arenburg Jan 08 '15 at 12:54
  • 12
    @DavidArenburg good point. I've re-written syntax to hopefully make more clear what my main points are, and to highlight that you can use `%>%` with data.table – hadley Jan 08 '15 at 13:28
  • 6
    Thanks Hadley, this is a useful perspective. Re indenting I typically do `DT[\n\texpression\n][\texpression\n]` (**[gist](https://gist.github.com/brodieG/85271ef3de82509ad036)**) which actually works pretty well. I'm keeping Arun's answer as the answer as he more directly answers my specific questions which are not so much about the accessibility of syntax, but I think this a good answer for people trying to get a general feel for the differences / commonalities between `dplyr` and `data.table`. – BrodieG Jan 08 '15 at 13:39
  • A minor point: on distinguishing joins from data.frame column subset, you can do `X[J(Y)]` or `X[.(Y)]`. – Arun Jan 08 '15 at 16:07
  • 3
    @BrodieG that's much better, but I still don't love it. – hadley Jan 08 '15 at 16:55
  • 2
    Fair enough. While I am a happy `data.table` user (and occasional `dplyr` user), every now and then I feel `data.table` tries to do too much inside `[`. – BrodieG Jan 08 '15 at 17:48
  • 3
    @BrodieG putting *"everything"* which is needed for relation algebra in the `[` operator forms **data.table queries** which can be direct analogous (and deep extension) to **SQL queries**. At the moment I don't see any of `[` args redundant (except the `drop`), and yes there are already many of them :) – jangorecki Jan 08 '15 at 20:01
  • @JanGorecki, I actually don't have any issues with how many arguments `data.table` has (I phrased this poorly), but for example sometimes I feel that `i` shouldn't be used for both joins and indexing, though the compact syntax is nice (as opposed to `merge`), so I basically have mixed feelings about it. – BrodieG Jan 08 '15 at 20:22
  • 35
    Why working on fastread when there is already `fread()`? Wouldn't time be spent better on improving fread() or working on other (underdeveloped) things? – EDi Jan 28 '15 at 12:17
  • 4
    @EDi because fastread and fread have different goals. fastread will also provide C++ api so you can use the same underlying code to read in new file formats – hadley Jan 28 '15 at 13:23
  • 12
    The API of `data.table` is founded on a massive abuse of the `[]` notation. That is its greatest strength and its greatest weakness. – Paul Jan 28 '16 at 11:57
  • 4
    @DavidArenburg using the pipe operator as in the example above you notice there are 4 dots(.) Two of them are for the datatable object itself while the other two are not. This is to me is just confusing. – MySchizoBuddy Jun 20 '16 at 19:54
  • 2
    Thank you for the permission to use data.table on large data! That alone was worth the answer. – piccolbo Apr 30 '18 at 22:46
  • I prefer the `data.table` snippet you show to the `dplyr` : and I used `data.table` only once for a small task 18 months ago. Quite readable and concise. I _do_ wish there were a reasonable parallel in `python`-land. The `datatable` package in `python` is slowly improving but does not begin to touch the `r` `data.table` – StephenBoesch Mar 30 '20 at 00:17
68

In direct response to the Question Title...

dplyr definitely does things that data.table can not.

Your point #3

dplyr abstracts (or will) potential DB interactions

is a direct answer to your own question but isn't elevated to a high enough level. dplyr is truly an extendable front-end to multiple data storage mechanisms where as data.table is an extension to a single one.

Look at dplyr as a back-end agnostic interface, with all of the targets using the same grammer, where you can extend the targets and handlers at will. data.table is, from the dplyr perspective, one of those targets.

You will never (I hope) see a day that data.table attempts to translate your queries to create SQL statements that operate with on-disk or networked data stores.

dplyr can possibly do things data.table will not or might not do as well.

Based on the design of working in-memory, data.table could have a much more difficult time extending itself into parallel processing of queries than dplyr.


In response to the in-body questions...

Usage

Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).

This may seem like a punt but the real answer is no. People familiar with tools seem to use the either the one most familiar to them or the one that is actually the right one for the job at hand. With that being said, sometimes you want to present a particular readability, sometimes a level of performance, and when you have need for a high enough level of both you may just need another tool to go along with what you already have to make clearer abstractions.

Performance

Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

Again, no. data.table excels at being efficient in everything it does where dplyr gets the burden of being limited in some respects to the underlying data store and registered handlers.

This means when you run into a performance issue with data.table you can be pretty sure it is in your query function and if it is actually a bottleneck with data.table then you've won yourself the joy of filing a report. This is also true when dplyr is using data.table as the back-end; you may see some overhead from dplyr but odds are it is your query.

When dplyr has performance issues with back-ends you can get around them by registering a function for hybrid evaluation or (in the case of databases) manipulating the generated query prior to execution.

Also see the accepted answer to when is plyr better than data.table?

Community
  • 1
  • 1
Thell
  • 5,600
  • 29
  • 53
  • 3
    Cant dplyr wrap a data.table with tbl_dt? Why not just get the best of both worlds? – aaa90210 Dec 09 '14 at 09:08
  • 1
    @aaa90210, see this **[post](http://stackoverflow.com/questions/27511604/dplyr-on-data-table-am-i-really-using-data-table)** – BrodieG Dec 17 '14 at 01:26
  • 25
    You forget to mention the reverse statement *"data.table definitely does things that dplyr can not"* which is also true. – jangorecki Jan 05 '15 at 16:26
  • 2
    @JanGorecki That's because I am not aware of a _capability_ that `data.table` has that `dplyr` isn't capable of either directly or via handlers. There are _features_ (as discussed in terms of speed, memory and syntax) that have been discussed as differences in response to the OPs qualitative (_poorly_) portion of the question yet I don't recall seeing _capabilities_ that couldn't/can't be generalized and abstracted up a layer. – Thell Jan 05 '15 at 18:53
  • 27
    Arun answer explains it well. Most important (in terms of performance) would be fread, update by reference, rolling joins, overlapping joins. I believe there are no any package (not only dplyr) which can compete with those features. A nice example can be last slide from [this](https://github.com/Rdatatable/data.table/wiki/talks/EARL2014_OverlapRangeJoin_Arun.pdf) presentation. – jangorecki Jan 05 '15 at 19:46
  • 15
    Totally, data.table is why I still use R. Otherwise I'd would use pandas. It's even better/faster than pandas. – marbel Dec 02 '16 at 04:01
  • 8
    I like data.table because of its simplicity and resemblance to SQL syntax structure. My job involves doing very intense ad hoc data analysis and graphics everyday for statistical modeling, and I really need tool simple enough to do complicated things. Now I can reduce my toolkit to only data.table for data and lattice for graph in my everyday job. Give an example I can even do operations like this: $DT[group==1,y_hat:=predict(fit1,data=.SD),]$, which is really neat and I consider it as an great advantage from SQL in classic R environment. – xappppp Dec 30 '16 at 21:06
9

Reading Hadley and Arun's answers one gets the impression that those who prefer dplyr's syntax would have in some cases to switch over to data.table or compromise for long running times.

But as some have already mentioned, dplyr can use data.table as a backend. This is accomplished using the dtplyr package which recently had it's version 1.0.0 release. Learning dtplyr incurs practically zero additional effort.

When using dtplyr one uses the function lazy_dt() to declare a lazy data.table, after which standard dplyr syntax is used to specify operations on it. This would look something like the following:

new_table <- mtcars2 %>% 
  lazy_dt() %>%
  filter(wt < 5) %>% 
  mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
  group_by(cyl) %>% 
  summarise(l100k = mean(l100k))

  new_table

#> Source: local data table [?? x 2]
#> Call:   `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)), 
#>     keyby = .(cyl)]
#> 
#>     cyl l100k
#>   <dbl> <dbl>
#> 1     4  9.05
#> 2     6 12.0 
#> 3     8 14.9 
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

The new_table object is not evaluated until calling on it as.data.table()/as.data.frame()/as_tibble() at which point the underlying data.table operation is executed.

I've recreated a benchmark analysis done by data.table author Matt Dowle back at December 2018 which covers the case of operations over large numbers of groups. I've found that dtplyr indeed enables for the most part those who prefer the dplyr syntax to keep using it while enjoying the speed offered by data.table.

Iyar Lin
  • 147
  • 1
  • 4
  • 1
    you probably won't have many features there for which there is no API in dplyr, like sub assignment by reference, rolling joins, overlapping joins, non equi joins, update on join, and probably many others. – jangorecki Jun 23 '20 at 22:18
  • I have to admit none of those features rings a bell. Could you please provide concrete examples in data.table? – Iyar Lin Jun 25 '20 at 02:54
  • 2
    [`?data.table` examples](https://rdatatable.gitlab.io/data.table/reference/data.table.html#examples), all I mentioned, except for overlapping joins. are there – jangorecki Jun 26 '20 at 17:49
  • Update on join, rolling, overlapping joins can be constructed straightforwardly with several parts of a pipe. – Arthur Yip Oct 16 '20 at 10:38
  • See fuzzyjoin for non-equi joins (seems to have even more features and functionality than data.table's non-equi joins). – Arthur Yip Oct 16 '20 at 10:38