I can't wrap my mind around how .SD
works and therefore I can't integrate it in my workflow.
set.seed(10238)
DT <- data.table(A = rep(1:3, each = 5), B = rep(1:5, 3),
C = sample(15), D = sample(15))
A datatable is a list, each column name is the name of the element and the columns are the vector element of the list. Therefore :
lapply(DT, mean)
$A
[1] 2
$B
[1] 3
$C
[1] 8
$D
[1] 8
When I use a column name in the j-expression it can be seen as a vector representing the data in the column. Therefore :
DT[, mean(B)]
[1] 3
But it starts to become tricky when the keyword by
is used : how to experience it when grouping with by
is being performed ?
DT[, mean(B), by=A]
A V1
1: 1 3
2: 2 3
3: 3 3
DT[, print(B), by=A]
[1] 1 2 3 4 5
[1] 1 2 3 4 5
[1] 1 2 3 4 5
Here it looks like the keyword A splits the vector B into 3 vectors based on the groups present in A. How should one see B here ? Is it a list of vectors, is it a datatable, how is the resulting datatable reconstructed ?
> DT[, lapply(B, mean), by=A]
A V1 V2 V3 V4 V5
1: 1 1 2 3 4 5
2: 2 1 2 3 4 5
3: 3 1 2 3 4 5
I can't make sense of that, this should return the same result as DT[, mean(B), by=A]
as lapply is being fed the 3 vectors individually and should apply mean on them, the resulting list should be reconstructed into the datatable seen before.
Finally I'm looking to convert thhe class of few columns, I don't understand why I have to use :
DT[, names(DT) := lapply(.SD, as.character)]
and not :
DT[, names(DT) := lapply(DT, as.character)]
It should be the same, lapply here applies as.character to each column of DT and returns a list containing vectors of caracters named after the column and in the same order.