How do I create a column containing aggregated means with R?

Question

In R, I have a bunch of data in a dataframe like:

state | zip   | value
______|_______|______
CA    | 94555 | 18
CA    | 94556 | 5
OH    | 12345 | 22
OH    | 12346 | 10

and so on.

I want an add a column to each row listing the mean 'value' for that state.

I can get a dataframe of the means via "(aggregate(data$value, list(State = data$state), mean))". That gives me a dataframe with 50 rows, one for each state. But I need to then go back into the original dataframe and put the state's average in rows belonging to that state.

How would I go about doing this?

To clarify: you only want the state mean and not mean by state + zip, right? — A5C1D2H2I1M1N2O1R2T1, Aug 14 '12 at 03:26
Can I ask why there are no upvotes for this question despite there being (currently) five >=2 answers? Am I missing something? — Andy Barbour, Aug 14 '12 at 06:46

score 6 · Answer 1 · edited Aug 14 '12 at 09:49

6

And a data.table solution

library(data.table)
DT <-  data.table(state = c("CA","CA","OH","OH"), 
                   zip = c(94555,94556,12345,12346), 
                   value = c(18, 5, 22, 10))

DT[, mean := mean(value), by = state]

##    state   zip value mean
## 1:    CA 94555    18 11.5
## 2:    CA 94556     5 11.5
## 3:    OH 12345    22 16.0
## 4:    OH 12346    10 16.0

edited Aug 14 '12 at 09:49

Matt Dowle

56,107
20
160
217

answered Aug 14 '12 at 03:35

mnel

105,872
25
248
242

score 4 · Answer 2 · answered Aug 14 '12 at 03:20

4

Use the merge command, e.g.

df = data.frame(state=c('CA','CA','OH','OH'),zip=c(94555,94556,12345,12346),value=c(18,5,22,10))
df2 = aggregate(df$value,list(state=df$state),mean)
merge(df,df2)

answered Aug 14 '12 at 03:20

blindjesse

4,088
23
34

1

As with the accepted answer, there's variable name repetition here: `df` twice and `df2` twice. See [here](http://stackoverflow.com/a/10758086/403310) for why variable name repetition can sometimes lead to bugs. – Matt Dowle Aug 14 '12 at 10:01

shhhhimhuntingrabbits · Answer 3 · 2012-08-14T03:50:56.547

4

You can use plyr

library(plyr)
df<-data.frame(state=c("CA","CA","OH","OH"),zip=c(94555,94556,12345,12346),value=c(18,5,22,10))
out<-ddply(df,.(state),transform,mean=mean(value))
> out
  state   zip value mean
1    CA 94555    18 11.5
2    CA 94556     5 11.5
3    OH 12345    22 16.0
4    OH 12346    10 16.0

as noted by mnel mutate could also be used and should be faster

ddply(df,.(state),mutate,mean=mean(value))

edited Aug 14 '12 at 03:50

answered Aug 14 '12 at 03:25

shhhhimhuntingrabbits

7,017
2
18
20

2

You could also use `mutate` instead of `transform`. – mnel Aug 14 '12 at 03:40

Tyler Rinker · Answer 4 · 2012-08-14T04:15:13.987

3

This is a perfect time to use the forgotten ave function:

dat <- data.frame(state = c('CA','CA','OH','OH'), 
    zip = c('94555','94556','12345','12346'), 
    value = c(18,5,22,10))

dat$mean <- ave(dat$value, dat$state, FUN=mean)

edited Aug 14 '12 at 04:15

answered Aug 14 '12 at 04:00

Tyler Rinker

99,090
56
292
477

+1 This answer repeats `dat` 3 times, though. See [here](http://stackoverflow.com/a/10758086/403310) for how variable name repetition can sometimes lead to bugs. – Matt Dowle Aug 14 '12 at 09:59

score 2 · Accepted Answer · answered Aug 14 '12 at 03:25

2

Merge is the key here.

data <- data.frame(state = c('CA','CA','OH','OH'), zip = c('94555','94556','12345','12346'), value = c(18,5,22,10))
aggs <- aggregate(data$value, list(State = data$state), mean)
names(aggs) <- c('state','avg')
merge(data, aggs, by = 'state')

answered Aug 14 '12 at 03:25

Tommy O'Dell

6,553
13
51
67

How is this any different from @blindJesse's [answer](http://stackoverflow.com/a/11945344/1270695)? – A5C1D2H2I1M1N2O1R2T1 Aug 14 '12 at 03:28
The 'by' was the difference here. That worked perfectly! Thank you! – int3h Aug 14 '12 at 03:32
His answer seems to have arrived between the time that I opened this question and submitted my answer. – Tommy O'Dell Aug 14 '12 at 03:34
Yeah, this was the first one I happened to see when I reloaded the page, and it worked. I'm sure blindJesse's answer would work as well. Many ways to skin this cat, apparently. I just couldn't figure out the right way to phrase it to get Google to give me an answer. – int3h Aug 14 '12 at 03:51
This answer repeats the variable name `data` 3 times and `agg` 3 times, too. See [here](http://stackoverflow.com/a/10758086/403310) for why variable name repetition can lead to bugs. – Matt Dowle Aug 14 '12 at 09:56

How do I create a column containing aggregated means with R?

5 Answers5