Questions tagged [data.table]

The R data.table package is an extension of data.frame built for fast in-memory data analysis. Use the dt tag for the DataTables package with Shiny (DT).

's data.table package provides an enhanced version of data.frame including fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast overlapping range joins, fast add/modify/delete of columns by reference by group using no copies at all, and a fast file reader: fread. It has a natural syntax: DT[where|order, select|update, by]. SQL-inspired syntax enables joins within [] by using on to specify matching columns. These queries can be chained together just by adding another one on the end: DT[...][...].

The aggregation features are analogous to stats::ave, plyr::ddply, dplyr::group_by and Python's pandas, but faster.

Repositories

Detailed HTML vignettes

Other vignettes to follow, see here and feel free to voice support for your most-wanted!

Other resources

Other operations to be benchmarked.

Related tags

11627 questions
815
votes
4 answers

data.table vs dplyr: can one do something well the other can't or does poorly?

Overview I'm relatively familiar with data.table, not so much with dplyr. I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that: data.table and dplyr are comparable in speed, except when…
BrodieG
  • 48,306
  • 7
  • 80
  • 131
216
votes
8 answers

How do you delete a column by name in data.table?

To get rid of a column named "foo" in a data.frame, I can do: df <- df[-grep('foo', colnames(df))] However, once df is converted to a data.table object, there is no way to just remove a column. Example: df <- data.frame(id = 1:100, foo =…
Maiasaura
  • 29,590
  • 23
  • 96
  • 103
210
votes
2 answers

Understanding exactly when a data.table is a reference to (vs a copy of) another data.table

I'm having a little trouble understanding the pass-by-reference properties of data.table. Some operations seem to 'break' the reference, and I'd like to understand exactly what's happening. On creating a data.table from another data.table (via <-,…
Peter Fine
  • 2,713
  • 3
  • 12
  • 15
181
votes
3 answers

What does .SD stand for in data.table in R

.SD looks useful but I do not really know what I am doing with it. What does it stand for? Why is there a preceding period (full stop). What is happening when I use it? I read: .SD is a data.table containing the subset of x's data for each group,…
Farrel
  • 9,584
  • 19
  • 57
  • 95
163
votes
8 answers

Aggregate / summarize multiple variables per group (e.g. sum, mean)

From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously? Below are some sample data: library(lubridate) days = 365*2 date = seq(as.Date("2000-01-01"), length = days, by = "day") year =…
MikeTP
  • 6,836
  • 15
  • 42
  • 56
162
votes
4 answers

Why were pandas merges in python faster than data.table merges in R in 2012?

I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It's even faster than the data.table package in R (my language of choice for analysis). Why is pandas so much faster than…
Zach
  • 27,553
  • 31
  • 130
  • 193
162
votes
10 answers

Fastest way to replace NAs in a large data.table

I have a large data.table, with many missing values scattered throughout its ~200k rows and 200 columns. I would like to re code those NA values to zeros as efficiently as possible. I see two options: 1: Convert to a data.frame, and use something…
Zach
  • 27,553
  • 31
  • 130
  • 193
161
votes
18 answers

Replacing NAs with latest non-NA value

In a data.frame (or data.table), I would like to "fill forward" NAs with the closest previous non-NA value. A simple example, using vectors (instead of a data.frame) is the following: > y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA) I would like a…
Ryogi
  • 4,907
  • 5
  • 22
  • 43
160
votes
6 answers

How to delete a row by reference in data.table?

My question is related to assignment by reference versus copying in data.table. I want to know if one can delete rows by reference, similar to DT[ , someCol := NULL] I want to know about DT[someRow := NULL, ] I guess there's a good reason for why…
Florian Oswald
  • 4,694
  • 5
  • 26
  • 32
151
votes
2 answers

Assign multiple columns using := in data.table, by group

What is the best way to assign to multiple columns using data.table? For example: f <- function(x) {c("hi", "hello")} x <- data.table(id = 1:10) I would like to do something like this (of course this syntax is incorrect): x[ , (col1, col2) := f(),…
Alex
  • 17,745
  • 33
  • 112
  • 182
147
votes
5 answers

Select multiple columns in data.table by their numeric indices

How can we select multiple columns using a vector of their numeric indices (position) in data.table? This is how we would do with a data.frame: df <- data.frame(a = 1, b = 2, c = 3) df[ , 2:3] # b c # 1 2 3
jamborta
  • 4,821
  • 6
  • 28
  • 50
140
votes
2 answers

Why is rbindlist "better" than rbind?

I am going through documentation of data.table and also noticed from some of the conversations over here on SO that rbindlist is supposed to be better than rbind. I would like to know why is rbindlist better than rbind and in which scenarios…
CHP
  • 16,149
  • 4
  • 33
  • 56
134
votes
3 answers

Why does X[Y] join of data.tables not allow a full outer join, or a left join?

This is a bit of a philosophical question about data.table join syntax. I am finding more and more uses for data.tables, but still learning... The join format X[Y] for data.tables is very concise, handy and efficient, but as far as I can tell, it…
Douglas Clark
  • 2,687
  • 2
  • 14
  • 20
134
votes
3 answers

Sort rows in data.table in decreasing order on string key `order(-x,v)` gives error on data.table 1.9.4 or earlier

Let's say I have the following data.table in R: library(data.table) DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9) I want to order it by two columns (say columns x and v). I used this: DT[order(x,v)] # sorts first by x then…
nhern121
  • 3,559
  • 4
  • 22
  • 33
130
votes
2 answers

How to reorder data.table columns (without copying)

I'd like to reorder columns in my data.table x, given a character vector of column names, neworder: library(data.table) x <- data.table(a = 1:3, b = 3:1, c = runif(3)) neworder <- c("c", "b", "a") Obviously I could do: x[ , neworder, with =…
Michael
  • 5,283
  • 4
  • 27
  • 37
1
2 3
99 100