Efficient way to modify a data frame in r

Question

I have a data frame like the following (but much larger)

df<-data.frame(Claim=c("117","249","117","117","249","652"),ValuationDate=c("01/05/1965","01/05/1980","01/10/1965","01/20/1965","01/30/1980","01/10/1990"))

df
  Claim ValuationDate
1   117    01/05/1965
2   249    01/05/1980
3   117    01/10/1965
4   117    01/20/1965
5   249    01/30/1980
6   652    01/10/1990

I want to add a column, "ValuationNumber" which is 1 for the earliest valuation date for each claim, and increases by 1 for each subsequent valuation date (for each claim). E.g. the ValuationNumber for row 5 would be 2, because row 5 describes claim 249 on date 01/30/1980, which is the second time claim 249 was evaluated. The complete output for df would look like

df
  Claim ValuationDate  ValuationNumber
1   117    01/05/1965   1
2   249    01/05/1980   1
3   117    01/10/1965   2
4   117    01/20/1965   3
5   249    01/30/1980   2
6   652    01/10/1990   1

What's an efficient way to add the column I need?

score 2 · Accepted Answer · edited May 23 '17 at 11:49

2

This is very similar to this data.table question:

library(data.table)
dt = data.table(df)

dt[, ValuationNumber := 1:.N, by = Claim]
dt
#   Claim ValuationDate ValuationNumber
#1:   117    01/05/1965               1
#2:   249    01/05/1980               1
#3:   117    01/10/1965               2
#4:   117    01/20/1965               3
#5:   249    01/30/1980               2
#6:   652    01/10/1990               1

edited May 23 '17 at 11:49

Community

1
1

answered Jul 08 '13 at 20:27

eddi

47,367
6
94
148

While in this case `.N` should always be `>=1`, `seq_len(.N)` is a safer idiom to use to avoid the `1:0` pitfall (which easily happen (see `fortune(26)`) – mnel Jul 08 '13 at 23:16

score 2 · Answer 2 · answered Jul 08 '13 at 21:58

2

An R Base solution

> transform(df, ValuationNumber=ave(rep(1,nrow(df)), df$Claim, FUN=seq))
  Claim ValuationDate ValuationNumber
1   117    01/05/1965               1
2   249    01/05/1980               1
3   117    01/10/1965               2
4   117    01/20/1965               3
5   249    01/30/1980               2
6   652    01/10/1990               1

answered Jul 08 '13 at 21:58

Jilber Urbina

50,760
8
101
127

Better than my fumbling efforts to supply `ave` with a numeric vector. – IRTFM Jul 09 '13 at 00:25

score 1 · Answer 3 · answered Jul 08 '13 at 21:03

1

Using plyr:

ddply(df, "Claim", function(d) { d$ValuationNumber <- seq_along(d$Claim); d})

This will also destroy the order.

answered Jul 08 '13 at 21:03

krlmlr

22,030
13
107
191

3

`ddply(df, "Claim", mutate, ValuationNumber = seq_along(Claim))` is slightly easier to read, less typing and still a `plyr` solution – mnel Jul 09 '13 at 00:23

Thomas · Answer 4 · 2013-07-08T20:44:31.523

0

You can do this with by splitting by date, adding a new column, and rebuilding the df:

out <- lapply(split(df,df$Claim), function(x) {
    ValuationNumber <- 1:dim(x)[1]
    cbind(x,ValuationNumber)
})
dfout <- do.call(rbind, out)

And to optionally sort back by date (but it would be better to convert to a date class first so that this was in actual chronological order):

dfout[order(dfout$ValuationDate),]

The results:

      Claim ValuationDate ValuationNumber
117.1   117    01/05/1965               1
249.2   249    01/05/1980               1
117.3   117    01/10/1965               2
652     652    01/10/1990               1
117.4   117    01/20/1965               3
249.5   249    01/30/1980               2

edited Jul 08 '13 at 20:44

answered Jul 08 '13 at 20:03

Thomas

40,508
11
98
131

Note that you're **not** sorting by date, but lexicographically instead (and either way you lose the original order). – eddi Jul 08 '13 at 20:34
@eddi That is true, but order is rarely important. – Thomas Jul 08 '13 at 20:43

IRTFM · Answer 5 · 2013-07-09T00:22:22.390

If this were being done on a dataset known to be ordered within claim number it would be as easy as:

df$ValuationNumber <- ave( 
 as.numeric(as.Date( as.character(df$ValuationDate), format="%m/%d/%Y")), 
 df$Claim, 
 FUN=seq_along)  # actually turned out to be some what not-easy.

Without assurance of the ordering you would first need to sort by date and at the moment you have a non-Date version of ValuationDate.

 df <- df[order( 
          as.Date( as.character(df$ValuationDate), 
                              format="%m/%d/%Y")) , ]
 df$ValuationNumber <- ave( 
     as.numeric(as.Date( as.character(df$ValuationDate), format="%m/%d/%Y")), 
     df$Claim, 
     FUN=seq_along)

Efficient way to modify a data frame in r

5 Answers5