151

What is the best way to assign to multiple columns using data.table? For example:

f <- function(x) {c("hi", "hello")}
x <- data.table(id = 1:10)

I would like to do something like this (of course this syntax is incorrect):

x[ , (col1, col2) := f(), by = "id"]

And to extend that, I may have many columns with names stored in a variable (say col_names) and I would like to do:

x[ , col_names := another_f(), by = "id", with = FALSE]

What is the correct way to do something like this?

Moody_Mudskipper
  • 39,313
  • 10
  • 88
  • 124
Alex
  • 17,745
  • 33
  • 112
  • 182
  • 1
    This looks like it has been answered: http://stackoverflow.com/questions/11308754/add-multiple-columns-to-r-data-table-in-one-function-call – Alex Jul 27 '12 at 20:52
  • Alex, That answer is close but it doesn't seem to work in combination with `by` as @Christoph_J is correct to say. Link to your question added to [FR#2120](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2120&group_id=240&atid=978) "Drop needing with=FALSE for LHS of :=", so it won't get forgotten to revisit. – Matt Dowle Aug 08 '12 at 15:29
  • To be clear, `f()` is a function returning multiple values, one for each of your columns. – smci May 04 '18 at 06:10

2 Answers2

174

This now works in v1.8.3 on R-Forge. Thanks for highlighting it!

x <- data.table(a = 1:3, b = 1:6) 
f <- function(x) {list("hi", "hello")} 
x[ , c("col1", "col2") := f(), by = a][]
#    a b col1  col2
# 1: 1 1   hi hello
# 2: 2 2   hi hello
# 3: 3 3   hi hello
# 4: 1 4   hi hello
# 5: 2 5   hi hello
# 6: 3 6   hi hello

x[ , c("mean", "sum") := list(mean(b), sum(b)), by = a][]
#    a b col1  col2 mean sum
# 1: 1 1   hi hello  2.5   5
# 2: 2 2   hi hello  3.5   7
# 3: 3 3   hi hello  4.5   9
# 4: 1 4   hi hello  2.5   5
# 5: 2 5   hi hello  3.5   7
# 6: 3 6   hi hello  4.5   9 

mynames = c("Name1", "Longer%")
x[ , (mynames) := list(mean(b) * 4, sum(b) * 3), by = a]
#     a b col1  col2 mean sum Name1 Longer%
# 1: 1 1   hi hello  2.5   5    10      15
# 2: 2 2   hi hello  3.5   7    14      21
# 3: 3 3   hi hello  4.5   9    18      27
# 4: 1 4   hi hello  2.5   5    10      15
# 5: 2 5   hi hello  3.5   7    14      21
# 6: 3 6   hi hello  4.5   9    18      27


x[ , get("mynames") := list(mean(b) * 4, sum(b) * 3), by = a][]  # same
#    a b col1  col2 mean sum Name1 Longer%
# 1: 1 1   hi hello  2.5   5    10      15
# 2: 2 2   hi hello  3.5   7    14      21
# 3: 3 3   hi hello  4.5   9    18      27
# 4: 1 4   hi hello  2.5   5    10      15
# 5: 2 5   hi hello  3.5   7    14      21
# 6: 3 6   hi hello  4.5   9    18      27

x[ , eval(mynames) := list(mean(b) * 4, sum(b) * 3), by = a][]   # same
#    a b col1  col2 mean sum Name1 Longer%
# 1: 1 1   hi hello  2.5   5    10      15
# 2: 2 2   hi hello  3.5   7    14      21
# 3: 3 3   hi hello  4.5   9    18      27
# 4: 1 4   hi hello  2.5   5    10      15
# 5: 2 5   hi hello  3.5   7    14      21
# 6: 3 6   hi hello  4.5   9    18      27

Older version using the with argument (we discourage this argument when possible):

x[ , mynames := list(mean(b) * 4, sum(b) * 3), by = a, with = FALSE][] # same
#    a b col1  col2 mean sum Name1 Longer%
# 1: 1 1   hi hello  2.5   5    10      15
# 2: 2 2   hi hello  3.5   7    14      21
# 3: 3 3   hi hello  4.5   9    18      27
# 4: 1 4   hi hello  2.5   5    10      15
# 5: 2 5   hi hello  3.5   7    14      21
# 6: 3 6   hi hello  4.5   9    18      27
MichaelChirico
  • 31,197
  • 13
  • 98
  • 169
Matt Dowle
  • 56,107
  • 20
  • 160
  • 217
  • Thanks for this answer and the examples. How should I modify the following line in order to get two columns for each objectName from the dim output, rather than one column with two rows? `data.table(objectName=ls())[,c("rows","cols"):=dim(get(objectName)),by=objectName]` (I'm using `data.table` 1.8.11) – dnlbrky May 19 '14 at 02:00
  • @dnlbrky `dim` returns a vector so converting that to type `list` should rotate it; e.g. `[,c("rows","cols"):=as.list(dim(get(objectName))),by=objectNa‌​me]`. Trouble is that `as.list` has call overhead and also copies the small vector. If efficiency is a problem as number of groups rises then please let us know. – Matt Dowle May 21 '14 at 11:49
  • Thanks @Matt_Dowle. I had tried `list` but not `as.list`. Speed isn't an issue. Just wanted a quick way to find objects in the environment that had a certain number of columns or rows. This is off topic, but... what do you think about adding NCOL to `tables()`? – dnlbrky May 23 '14 at 00:34
  • @dnlbrky Great idea, done. Just committed to v1.9.3 on R-Forge. – Matt Dowle May 25 '14 at 00:20
  • 1
    Hi Matt. The first example in your second code block (i.e. `x[,mynames:=list(mean(b)*4,sum(b)*3),by=a,with=FALSE][]`) now throws a warning, so maybe remove it? On a related note, has anyone suggested that, with `options(datatable.WhenJisSymbolThenCallingScope=TRUE)`, an assignment like `x[,mynames:=list(mean(b)*4,sum(b)*3),by=a]` should in fact work? Seems like that would be consistent with the other changes, though I guess it might break too much existing user code (?). – Josh O'Brien Nov 29 '16 at 17:49
  • @JoshO'Brien Matt removed that option in response to a bug shortly after your comment, fyi https://github.com/Rdatatable/data.table/issues/1952 – Frank Feb 10 '17 at 21:02
  • @MattDowle can you make it such that this syntax would work too? `x[, .(Col1, Col2) := 'somevalue']` This is particularly useful because you can then use RStudio's autocomplete for column names as you type them. – Merik Nov 29 '17 at 02:37
  • @Merik Seems like a good idea, should be possible. Please file a feature request on GitHub [here](https://github.com/Rdatatable/data.table/issues/new) - thanks. – Matt Dowle Nov 29 '17 at 02:44
  • I don't think by=a is necessary; not sure why it would be. I tried and it worked. – Wassadamo Feb 23 '18 at 20:33
  • 1
    @PanFrancisco Without `by=a` it will work, but return a different answer. The `mean(a)` and `sum(a)` aggregates are being recycled within each group when `by=a`. Without `by=a` it just sticks the `mean` and `sum` for the entire column into each cell (i.e. different numbers). – Matt Dowle Feb 24 '18 at 00:44
  • 1
    @MattDowle what if my function already returns named list, is there anyway I can add the columns to the dt without having to name them again? e.g. f – Feng Jiang Aug 05 '18 at 19:50
  • @Jfly That would be a good new question which would likely lead to a feature request filed on GitHub. Perhaps something like `x[, {ans=f(); names(ans):=ans}, by=a]` could be implemented. That syntax conveys the intent quite nicely to my eye. What you think? – Matt Dowle Aug 06 '18 at 17:30
  • Why do you use mynames with and without parenthesis? Is there a difference? – Herman Toothrot Oct 25 '18 at 10:28
  • @HermanToothrot `mynames:=` assigns to the column called `"mynames"`. `(mynames):=` assigns to the column name held in the `mynames` variable in calling scope. – Matt Dowle Nov 02 '18 at 20:18
58

The following shorthand notation might be useful. All credit goes to Andrew Brooks, specifically this article.

dt[,`:=`(avg=mean(mpg), med=median(mpg), min=min(mpg)), by=cyl]
Gerry
  • 1,408
  • 2
  • 13
  • 20