3

I have the following data frame in R:

id<-c(1,2,3,4,10,2,4,5,6,8,2,1,5,7,7)
date<-c(19970807,19970902,19971010,19970715,19991212,19961212,19980909,19990910,19980707,19991111,19970203,19990302,19970605,19990808,19990706)
spent<-c(1997,19,199,134,654,37,876,890,873,234,643,567,23,25,576)
df<-data.frame(id,date,spent)

I need to take a random sample of 3 customers (based on id) in a way that all observations of the customers be extracted.

Arun
  • 108,644
  • 21
  • 263
  • 366
AliCivil
  • 1,781
  • 6
  • 25
  • 40
  • You have duplicate ids in there. How do you want to deal with those? – A5C1D2H2I1M1N2O1R2T1 Aug 20 '12 at 04:37
  • if id=3 is in my sample just 1 record for that comes to my sample but if id=4 is in my sample then I would expect my sample to have 2 rows with id=4 but with different "date and "spent" – AliCivil Aug 20 '12 at 04:44

2 Answers2

6

You want to use %in% and unique

df[df$id %in% sample(unique(df$id),3),]
##    id     date spent
## 4   4 19970715   134
## 7   4 19980909   876
## 8   5 19990910   890
## 10  8 19991111   234
## 13  5 19970605    23

Using data.table to avoid $ referencing

library(data.table)
DT <- data.table(df)

 DT[id %in% sample(unique(id),3)]
##    id     date spent
## 1:  1 19970807  1997
## 2:  4 19970715   134
## 3:  4 19980909   876
## 4:  1 19990302   567
## 5:  7 19990808    25
## 6:  7 19990706   576

This ensures that you are always evaluating the expressions within the data.table.

mnel
  • 105,872
  • 25
  • 248
  • 242
2

Use something like:

df[sample(df$id, 3), ]
#   id     date spent
# 1  1 19970807  1997
# 5 10 19991212   654
# 8  5 19990910   890

Of course, your samples would be different.

Update

If you want unique customers, you can aggregate first.

df2 = aggregate(list(date = df$date, spent = df$spent), list(id = df$id), c)
df2[sample(df2$id, 3), ]
#   id               date    spent
# 4  4 19970715, 19980909 134, 876
# 5  5 19990910, 19970605  890, 23
# 8  8           19991111      234

OR--an option with out aggregate:

df[df$id %in% sample(unique(df$id), 3), ]
#    id     date spent
# 1   1 19970807  1997
# 3   3 19971010   199
# 12  1 19990302   567
# 14  7 19990808    25
# 15  7 19990706   576
A5C1D2H2I1M1N2O1R2T1
  • 177,446
  • 27
  • 370
  • 450
  • Thanks.But this is giving me sample of 3 records and not sample of records of 3 customers. – AliCivil Aug 20 '12 at 04:41
  • But count the repetition of the variable name `df`, possible mix ups between `df` and `df2`, and why that can be an issue [here](http://stackoverflow.com/a/10758086/403310). – Matt Dowle Aug 20 '12 at 16:01
  • @MatthewDowle, I wholeheartedly agree that this can be a major annoyance with R, and have read your linked answer several times before, as well as looked at the `data.table` documentation. I intend to learn `data.table` soon--it looks awesome!--but "soon" just hasn't happened yet (and since I don't actually even *use* R professionally, I'm not sure when that "yet" will come ;-) ). – A5C1D2H2I1M1N2O1R2T1 Aug 21 '12 at 04:12