-3

I can't wrap my mind around how .SD works and therefore I can't integrate it in my workflow.

set.seed(10238)
DT <- data.table(A = rep(1:3, each = 5), B = rep(1:5, 3),
                 C = sample(15), D = sample(15)) 

A datatable is a list, each column name is the name of the element and the columns are the vector element of the list. Therefore :

lapply(DT, mean)
$A
[1] 2
$B
[1] 3
$C
[1] 8
$D
[1] 8

When I use a column name in the j-expression it can be seen as a vector representing the data in the column. Therefore :

DT[, mean(B)]
[1] 3

But it starts to become tricky when the keyword by is used : how to experience it when grouping with by is being performed ?

DT[, mean(B), by=A]
   A V1
1: 1  3
2: 2  3
3: 3  3

DT[, print(B), by=A]
[1] 1 2 3 4 5
[1] 1 2 3 4 5
[1] 1 2 3 4 5

Here it looks like the keyword A splits the vector B into 3 vectors based on the groups present in A. How should one see B here ? Is it a list of vectors, is it a datatable, how is the resulting datatable reconstructed ?

> DT[, lapply(B, mean), by=A]
   A V1 V2 V3 V4 V5
1: 1  1  2  3  4  5
2: 2  1  2  3  4  5
3: 3  1  2  3  4  5

I can't make sense of that, this should return the same result as DT[, mean(B), by=A] as lapply is being fed the 3 vectors individually and should apply mean on them, the resulting list should be reconstructed into the datatable seen before.

Finally I'm looking to convert thhe class of few columns, I don't understand why I have to use :

DT[, names(DT) := lapply(.SD, as.character)] 

and not :

DT[, names(DT) := lapply(DT, as.character)] 

It should be the same, lapply here applies as.character to each column of DT and returns a list containing vectors of caracters named after the column and in the same order.

ChiseledAbs
  • 1,583
  • 3
  • 14
  • 28
  • 1
    `.SD` is a data.table containing the **S**ubset of x's **D**ata for each group, excluding any columns used in by (or keyby). See `?special-symbols` – Jaap Feb 06 '17 at 14:06

1 Answers1

4

You'd get a much better reception if you turned the focus towards the documentation.

I just typed ?.SD. The first paragraph points to the vignettes, the definition Jaap provided, and the example section at the bottom contains this :

DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X

DT[.N]                                 # last row, only special symbol allowed in 'i'
DT[, .N]                               # total number of rows in DT
DT[, .N, by=x]                         # number of rows in each group
DT[, .SD, .SDcols=x:y]                 # select columns 'x' and 'y'
DT[, .SD[1]]                           # first row of all columns
DT[, .SD[1], by=x]                     # first row of 'y' and 'v' for each group in 'x'
DT[, c(.N, lapply(.SD, sum)), by=x]    # get rows *and* sum columns 'v' and 'y' by group
DT[, .I[1], by=x]                      # row number in DT corresponding to each group
DT[, .N, by=rleid(v)]                  # get count of consecutive runs of 'v'
DT[, c(.(y=max(y)), lapply(.SD, min)), 
        by=rleid(v), .SDcols=v:b]      # compute 'j' for each consecutive runs of 'v'
DT[, grp := .GRP, by=x]                # add a group counter
X[, DT[.BY, y, on="x"], by=x]          # join within each group

So, please say what is wrong with the English or the examples. Tell people and show people that you've read the documentation by referring to it.

On the last part :

Finally I'm looking to convert the class of few columns, I don't understand why I have to use :

DT[, names(DT) := lapply(.SD, as.character)]  

and not :

DT[, names(DT) := lapply(DT, as.character)]

You can in this case. They are the same because there is no grouping (by= or keyby=) and no i subset either. Why do you think you have to use the first one? You're starting off on the wrong foot by not referring to where you have read that "you have to". The first one is preferable because using .SD in that case saves another variable name repetition of DT (the general principle explained here). If memory efficiency is important then a simple for loop avoids creating the whole RHS of := first; the for loop way does it column by column (more detail here).

Community
  • 1
  • 1
Matt Dowle
  • 56,107
  • 20
  • 160
  • 217