Is there a better syntax for subsetting a data frame in R?

Question

I want to conditionally subset a dataframe without referencing the dataframe. For example if I have the following:

long_data_frame_name <- data.frame(x=1:10, y=1:10)

I want to say:

subset <- long_data_frame_name[x < 5,]

But instead, I have to say:

subset <- long_data_frame_name[long_data_frame_name$x < 5,]

plyr and ggplot handle this so beautifully. Is there any package that makes subsetting a data frame similarly beautiful?

Beauty is in the eye of the beholder. :-) You may find `with` more lovely. — Carl Witthoft, Nov 01 '12 at 15:07
@Carl I don't see how `with` would apply in this case. That being said, though, I think it's hard to write beautiful R code without using `with` and `within`. — Matthew Plourde, Nov 01 '12 at 15:15
@Roman How so? You'd still have to type the name of the `data.frame` twice. — Matthew Plourde, Nov 01 '12 at 15:23
I agree with @mplourde: with `with`, you would still need `subset — Ben Bolker, Nov 01 '12 at 15:24
@BenBolker Yes, you're right. I was trying :-( to extend my snarky comment about the subjectivity of "attractive" code. I'd go with `subset` myself. — Carl Witthoft, Nov 01 '12 at 15:41
Bah, foiled by my cursory reading. With will not help you from not typing the name twice, of course... — Roman Luštrik, Nov 01 '12 at 15:54
Per beauty, I gave context for my standard of beauty with an example and an allusion to plyr and ggplot. Subjective and clearly defined. — Ben Haley, Nov 01 '12 at 16:05

score 10 · Accepted Answer · edited May 23 '17 at 12:09

10

It sounds like you are looking for the data.table package, which implements indexing syntax just like that which you describe. (data.table objects are essentially data.frames with added functionality, so you can continue to use them almost anywhere you would use a "plain old" data.frame.)

Matthew Dowle, the package's author, argues for the advantages of [.data.table()'s indexing syntax in his answer to this popular SO [r]-tag question. His answer there could just as well have been written as a direct response to your question above!

Here's an example:

library(data.table)
long_data_table_name <- data.table(x=1:10, y=1:10) 

subset <- long_data_table_name[x < 5, ]
subset
#    x y
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 4

edited May 23 '17 at 12:09

Community

1
1

answered Nov 01 '12 at 15:20

Josh O'Brien

148,908
25
332
435

2

@Ben Be advised that `data.table`s and `data.frame`s aren't completely interchangeable. Additionally, I'm not sure I see the point of loading a extra library just for a syntax convention that the base language effectively supports. – Matthew Plourde Nov 01 '12 at 16:09
@mplourde good warning. I think subset seems like a safer option. But clean syntax is important to me. Its the reason I prefer python. Its easier to read and therefore easier to comprehend. – Ben Haley Nov 01 '12 at 16:11
@mplourde -- Thanks for adding this caveat, with which I totally agree. I'd probably have hedged more had Ben not specifically asked for "a package that makes subsetting a data.frame similarly beautiful". – Josh O'Brien Nov 01 '12 at 16:12
3

@BenHaley -- And also be aware that `subset()` has its own shortcomings. In particular, do not try to use it programatically (i.e. within any functions you write); it's basically for interactive use, and there's even a Warning to that effect in `?subset`. – Josh O'Brien Nov 01 '12 at 16:15
@Josh I believe the warning, because it's there. But I can't for the life of me come up with a scenario where it breaks... – Matthew Plourde Nov 01 '12 at 16:39
1

@mplourde -- [Here's a link](http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset) with much discussion (and links to other examples) that should help you see the problem. (FWIW, it wasn't until *years* after I first read warning that I finally "got" what the problem with `subset()` was.) – Josh O'Brien Nov 01 '12 at 16:50

Davoud Taghawi-Nejad · Answer 2 · 2012-11-01T19:26:37.303

5

Yes:

newdata <- subset(mydata, sex=="m" & age > 25)

or

newdata <- subset(mydata, sex=="m" & age > 25 , select=weight:income)

Reference: http://www.statmethods.net/management/subset.html

edited Nov 01 '12 at 19:26

answered Nov 01 '12 at 16:00

Davoud Taghawi-Nejad

14,180
11
56
78

1

'subset – Ben Haley Nov 01 '12 at 16:09

score 4 · Answer 3 · answered Nov 01 '12 at 16:11

4

Beauty is subjective, isn't it? In the interest of sharing other solutions, there's also the sqldf package:

library(sqldf)
subset <- sqldf("select * from long_data_frame_name where x < 5")

answered Nov 01 '12 at 16:11

A5C1D2H2I1M1N2O1R2T1

177,446
27
370
450

score 3 · Answer 4 · answered Nov 18 '14 at 02:23

3

Try dplyr, released after this question was posted and answered. It is great for many common data frame munging tasks.

library(dplyr)
subset <- filter(long_data_frame_name, x > 5)

or, equivalently:

subset <- long_data_frame_name %>% filter(x > 5)

answered Nov 18 '14 at 02:23

NC maize breeding Jim

583
1
5
8

Is there a better syntax for subsetting a data frame in R?

4 Answers4