How to apply a function to a subset of data.table using by and exposing all columns to the function?

Question

When slicing a data.table by group(s), variables used to slice the data are not in the subset during the function execution. I demonstrate this using debugonce.

library(data.table)
x <- data.table(a = rep(letters[1:4], each = 3), b = rep(c("a", "b"), each = 6), c = rnorm(12))

myfun <- function(y) paste(y$a, y$b, y$c, collapse = "")

> debugonce(myfun)
> x[, myfun(.SD), by = .(b, a)]
debugging in: myfun(.SD)
debug: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
            c
1: -1.2662416
2:  0.9818497
3: -0.5395385

What I'm after is the functionality of the split-sapply paradigm, where I would slice a data.frame according to factor(s) and apply the function to all columns, that is, also including the variables which have been used to slice it (demonstrated below).

> debugonce(myfun)

> sapply(split(x, f = list(x$b, x$a)), FUN = myfun)
debugging in: FUN(X[[i]], ...)
debug: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
a b          c
1: a a -1.2662416
2: a a  0.9818497
3: a a -0.5395385

[This comment by Matt Dowle on a question regarding `.SD`](https://stackoverflow.com/questions/8508482/what-does-sd-stand-for-in-data-table-in-r#comment10533466_8509301) shows an alternative to use debugging: `x[, print(.SD), by = .(b, a)]`. — Uwe, Jul 25 '17 at 07:52

score 21 · Accepted Answer · edited Jun 20 '20 at 09:12

The OP has a function which takes a list as argument which should contain all columns of the data.table including the columns used for grouping in by.

According to help(".SD"):

.SD is a data.table containing the Subset of x's Data for each group, excluding any columns used in by (or keyby).

(emphasis mine)

.BY is a list containing a length 1 vector for each item in by. This can be useful when by is not known in advance.

So, .BY and .SD complement each other to access all columns of the data.table.

Instead of explicitely repeating the by columns in the function call

x[, myfun(c(list(b, a), .SD)), by = .(b, a)]

we can use

x[, myfun(c(.BY, .SD)), by = .(b, a)]

   b a                                                                 V1
1: a a    a a -1.02091215130492a a -0.295107569536843a a 0.77776326093429
2: a b b a -0.369037832486311b a -0.716211663822323b a -0.264799143319049
3: b c      c b -1.39603530693486c b 1.4707902839894c b 0.721925347069227
4: b d   d b -1.15220308230505d b -0.736782242593426d b 0.420986999145651

The OP has used debugonce() to show the argument passed to myfun():

> debugonce(myfun)
> x[, myfun(c(.BY, .SD)), by = .(b, a)]
debugging in: myfun(c(.BY, .SD))
debug at #1: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
$b
[1] "a"

$a
[1] "a"

$c
[1] -1.0209122 -0.2951076  0.7777633

Another example

With another sample data set and function it might be easier to exemplify the core of the question:

x <- data.table(a = rep(letters[3:6], each = 3), b = rep(c("x", "y"), each = 6), c = 1:12)
myfun <- function(y) paste(y$a, y$b, y$c, sep = "/", collapse = "-")

x[, myfun(.SD), by = .(b, a)]

   b a             V1
1: x c    //1-//2-//3
2: x d    //4-//5-//6
3: y e    //7-//8-//9
4: y f //10-//11-//12

So, columns band a do appear in the output as grouping variables but they aren't passed via .SD to the function.

Now, with .BY complementing .SD

x[, myfun(c(.BY, .SD)), by = .(b, a)]

   b a                   V1
1: x c    c/x/1-c/x/2-c/x/3
2: x d    d/x/4-d/x/5-d/x/6
3: y e    e/y/7-e/y/8-e/y/9
4: y f f/y/10-f/y/11-f/y/12

all columns of the data.table are passed to the function.

Separate arguments in the function call

Roland has suggested to pass .BY and .SD as separate parameters to the function. Indeed, .BY is a list object and .SD is a data.table object (which essentially is also a list which allowed us to use c(.BY, .SD)). There might be cases where the difference might matter.

To verify, we can define a function which prints str() as a side effect. The function is only called for the first group (.GRP == 1L).

myfun1 <- function(y) str(y)
x[, if (.GRP == 1L) myfun1(.SD), by = .(b, a)]

Classes ‘data.table’ and 'data.frame':    3 obs. of  1 variable:
 $ c: int  1 2 3
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, ".data.table.locked")= logi TRUE
Empty data.table (0 rows) of 2 cols: b,a

x[, if (.GRP == 1L) myfun1(.BY), by = .(b, a)]

List of 2
 $ b: chr "x"
 $ a: chr "c"
Empty data.table (0 rows) of 2 cols: b,a

x[, if (.GRP == 1L) myfun1(c(.BY, .SD)), by = .(b, a)]

List of 3
 $ b: chr "x"
 $ a: chr "c"
 $ c: int [1:3] 1 2 3
Empty data.table (0 rows) of 2 cols: b,a

Additional links

Beside help(".SD") the comments & answers to the following SO questions might by useful:

Note that this answer only works because `myfun` accepts a list. I would modify the function and pass `.BY` and `.SD` as separate parameters. — Roland, Jul 26 '17 at 06:27
@Roland `myfun` was defined by the OP to process _all_ columns of the data.table, so it purposefully takes a list argument. I agree that it would be cleaner to pass `.SD` and `.BY` separately. — Uwe, Jul 26 '17 at 07:31

How to apply a function to a subset of data.table using by and exposing all columns to the function?

1 Answers1

Another example

Separate arguments in the function call

Additional links

Linked