2

In R, I have a bunch of data in a dataframe like:

state | zip   | value
______|_______|______
CA    | 94555 | 18
CA    | 94556 | 5
OH    | 12345 | 22
OH    | 12346 | 10

and so on.

I want an add a column to each row listing the mean 'value' for that state.

I can get a dataframe of the means via "(aggregate(data$value, list(State = data$state), mean))". That gives me a dataframe with 50 rows, one for each state. But I need to then go back into the original dataframe and put the state's average in rows belonging to that state.

How would I go about doing this?

Arun
  • 108,644
  • 21
  • 263
  • 366
int3h
  • 452
  • 4
  • 14

5 Answers5

6

And a data.table solution

library(data.table)
DT <-  data.table(state = c("CA","CA","OH","OH"), 
                   zip = c(94555,94556,12345,12346), 
                   value = c(18, 5, 22, 10))

DT[, mean := mean(value), by = state]

##    state   zip value mean
## 1:    CA 94555    18 11.5
## 2:    CA 94556     5 11.5
## 3:    OH 12345    22 16.0
## 4:    OH 12346    10 16.0
Matt Dowle
  • 56,107
  • 20
  • 160
  • 217
mnel
  • 105,872
  • 25
  • 248
  • 242
4

Use the merge command, e.g.

df = data.frame(state=c('CA','CA','OH','OH'),zip=c(94555,94556,12345,12346),value=c(18,5,22,10))
df2 = aggregate(df$value,list(state=df$state),mean)
merge(df,df2)
blindjesse
  • 4,088
  • 23
  • 34
  • 1
    As with the accepted answer, there's variable name repetition here: `df` twice and `df2` twice. See [here](http://stackoverflow.com/a/10758086/403310) for why variable name repetition can sometimes lead to bugs. – Matt Dowle Aug 14 '12 at 10:01
4

You can use plyr

library(plyr)
df<-data.frame(state=c("CA","CA","OH","OH"),zip=c(94555,94556,12345,12346),value=c(18,5,22,10))
out<-ddply(df,.(state),transform,mean=mean(value))
> out
  state   zip value mean
1    CA 94555    18 11.5
2    CA 94556     5 11.5
3    OH 12345    22 16.0
4    OH 12346    10 16.0

as noted by mnel mutate could also be used and should be faster

ddply(df,.(state),mutate,mean=mean(value))
shhhhimhuntingrabbits
  • 7,017
  • 2
  • 18
  • 20
3

This is a perfect time to use the forgotten ave function:

dat <- data.frame(state = c('CA','CA','OH','OH'), 
    zip = c('94555','94556','12345','12346'), 
    value = c(18,5,22,10))

dat$mean <- ave(dat$value, dat$state, FUN=mean)
Tyler Rinker
  • 99,090
  • 56
  • 292
  • 477
  • +1 This answer repeats `dat` 3 times, though. See [here](http://stackoverflow.com/a/10758086/403310) for how variable name repetition can sometimes lead to bugs. – Matt Dowle Aug 14 '12 at 09:59
2

Merge is the key here.

data <- data.frame(state = c('CA','CA','OH','OH'), zip = c('94555','94556','12345','12346'), value = c(18,5,22,10))
aggs <- aggregate(data$value, list(State = data$state), mean)
names(aggs) <- c('state','avg')
merge(data, aggs, by = 'state')
Tommy O'Dell
  • 6,553
  • 13
  • 51
  • 67
  • How is this any different from @blindJesse's [answer](http://stackoverflow.com/a/11945344/1270695)? – A5C1D2H2I1M1N2O1R2T1 Aug 14 '12 at 03:28
  • The 'by' was the difference here. That worked perfectly! Thank you! – int3h Aug 14 '12 at 03:32
  • His answer seems to have arrived between the time that I opened this question and submitted my answer. – Tommy O'Dell Aug 14 '12 at 03:34
  • Yeah, this was the first one I happened to see when I reloaded the page, and it worked. I'm sure blindJesse's answer would work as well. Many ways to skin this cat, apparently. I just couldn't figure out the right way to phrase it to get Google to give me an answer. – int3h Aug 14 '12 at 03:51
  • This answer repeats the variable name `data` 3 times and `agg` 3 times, too. See [here](http://stackoverflow.com/a/10758086/403310) for why variable name repetition can lead to bugs. – Matt Dowle Aug 14 '12 at 09:56