0

I have a data frame like the following (but much larger)

df<-data.frame(Claim=c("117","249","117","117","249","652"),ValuationDate=c("01/05/1965","01/05/1980","01/10/1965","01/20/1965","01/30/1980","01/10/1990"))

df
  Claim ValuationDate
1   117    01/05/1965
2   249    01/05/1980
3   117    01/10/1965
4   117    01/20/1965
5   249    01/30/1980
6   652    01/10/1990

I want to add a column, "ValuationNumber" which is 1 for the earliest valuation date for each claim, and increases by 1 for each subsequent valuation date (for each claim). E.g. the ValuationNumber for row 5 would be 2, because row 5 describes claim 249 on date 01/30/1980, which is the second time claim 249 was evaluated. The complete output for df would look like

df
  Claim ValuationDate  ValuationNumber
1   117    01/05/1965   1
2   249    01/05/1980   1
3   117    01/10/1965   2
4   117    01/20/1965   3
5   249    01/30/1980   2
6   652    01/10/1990   1

What's an efficient way to add the column I need?

Ben
  • 15,465
  • 26
  • 90
  • 157

5 Answers5

2

This is very similar to this data.table question:

library(data.table)
dt = data.table(df)

dt[, ValuationNumber := 1:.N, by = Claim]
dt
#   Claim ValuationDate ValuationNumber
#1:   117    01/05/1965               1
#2:   249    01/05/1980               1
#3:   117    01/10/1965               2
#4:   117    01/20/1965               3
#5:   249    01/30/1980               2
#6:   652    01/10/1990               1
Community
  • 1
  • 1
eddi
  • 47,367
  • 6
  • 94
  • 148
  • While in this case `.N` should always be `>=1`, `seq_len(.N)` is a safer idiom to use to avoid the `1:0` pitfall (which easily happen (see `fortune(26)`) – mnel Jul 08 '13 at 23:16
2

An R Base solution

> transform(df, ValuationNumber=ave(rep(1,nrow(df)), df$Claim, FUN=seq))
  Claim ValuationDate ValuationNumber
1   117    01/05/1965               1
2   249    01/05/1980               1
3   117    01/10/1965               2
4   117    01/20/1965               3
5   249    01/30/1980               2
6   652    01/10/1990               1
Jilber Urbina
  • 50,760
  • 8
  • 101
  • 127
1

Using plyr:

ddply(df, "Claim", function(d) { d$ValuationNumber <- seq_along(d$Claim); d})

This will also destroy the order.

krlmlr
  • 22,030
  • 13
  • 107
  • 191
  • 3
    `ddply(df, "Claim", mutate, ValuationNumber = seq_along(Claim))` is slightly easier to read, less typing and still a `plyr` solution – mnel Jul 09 '13 at 00:23
0

You can do this with by splitting by date, adding a new column, and rebuilding the df:

out <- lapply(split(df,df$Claim), function(x) {
    ValuationNumber <- 1:dim(x)[1]
    cbind(x,ValuationNumber)
})
dfout <- do.call(rbind, out)

And to optionally sort back by date (but it would be better to convert to a date class first so that this was in actual chronological order):

dfout[order(dfout$ValuationDate),]

The results:

      Claim ValuationDate ValuationNumber
117.1   117    01/05/1965               1
249.2   249    01/05/1980               1
117.3   117    01/10/1965               2
652     652    01/10/1990               1
117.4   117    01/20/1965               3
249.5   249    01/30/1980               2
Thomas
  • 40,508
  • 11
  • 98
  • 131
0

If this were being done on a dataset known to be ordered within claim number it would be as easy as:

df$ValuationNumber <- ave( 
 as.numeric(as.Date( as.character(df$ValuationDate), format="%m/%d/%Y")), 
 df$Claim, 
 FUN=seq_along)  # actually turned out to be some what not-easy.

Without assurance of the ordering you would first need to sort by date and at the moment you have a non-Date version of ValuationDate.

 df <- df[order( 
          as.Date( as.character(df$ValuationDate), 
                              format="%m/%d/%Y")) , ]
 df$ValuationNumber <- ave( 
     as.numeric(as.Date( as.character(df$ValuationDate), format="%m/%d/%Y")), 
     df$Claim, 
     FUN=seq_along)
IRTFM
  • 240,863
  • 19
  • 328
  • 451