1

I want to remove duplicates and preserve the one where the year variable is maximum. My data looks like the following:

id  name    year    position
1   Jane    1990    Sales
1   Jane    1991    Sales
1   Jane    1992    Sales
1   Jane    1993    Boss
1   Jane    1994    CEO
2   Tom     1978    HR
2   Tom     1979    Sales
2   Tom     1980    PR
2   Tom     1981    Boss
3   Jim     1981    Sales
3   Jim     1982    Sales
3   Jim     1983    PR

The wanted output is:

   id   name    year    position
    1   Jane    1992    Sales
    1   Jane    1993    Boss
    1   Jane    1994    CEO
    2   Tom     1978    HR
    2   Tom     1979    Sales
    2   Tom     1980    PR
    2   Tom     1981    Boss
    3   Jim     1982    Sales
    3   Jim     1983    PR

Would there be a way to code this? I tried the following but did not work:

new<-ddply(df, df$position=="Sales", function(df) return(df[df$year==max(df$year),]))
song0089
  • 2,381
  • 7
  • 35
  • 60

1 Answers1

3
ddply(df, .(id, name, position), summarize, year = max(year))

if you want it to be sorted

arrange(ddply(df, .(id, name, position), summarize, year = max(year)), id, year)

I do recommend the succeeder of plyr: dplyr

library(dplyr)
df %>% group_by(id, name, position) %>% summarise(year=max(year)) %>% arrange(id, year)
Jaap
  • 71,900
  • 30
  • 164
  • 175
Randy Lai
  • 2,937
  • 2
  • 18
  • 22