1

I want to average the values by their IDs but not all ID's have the same number of values. How do I do this in R?

I have two columns ID and Value

    ID    Value
    1000    0.51
    1000    0.01
    1001    0.81
    1001    0.41
    1001    0.62
    1002    0.98
    1002    0.12
    1002    0.15
    1003    0.12
    ...     ...
derpill
  • 41
  • 1
  • 3

3 Answers3

7

You can try by():

> with(df, by(Value, ID, mean))
# ID: 1000
# [1] 0.26
# ------------------------------------------------------------ 
# ID: 1001
# [1] 0.6133333
# ------------------------------------------------------------ 
# ID: 1002
# [1] 0.4166667
# ------------------------------------------------------------ 
# ID: 1003
# [1] 0.12

or aggregate():

> aggregate( Value ~ ID, df, mean)
#     ID     Value
# 1 1000 0.2600000
# 2 1001 0.6133333
# 3 1002 0.4166667
# 4 1003 0.1200000

or using data.table (if you need fast calculation on large data sets):

> library(data.table)
> setDT(df)[, mean(Value), by = ID]
#      ID        V1
# 1: 1000 0.2600000
# 2: 1001 0.6133333
# 3: 1002 0.4166667
# 4: 1003 0.1200000

data

df <- structure(list(ID = c(1000L, 1000L, 1001L, 1001L, 1001L, 1002L, 
1002L, 1002L, 1003L), Value = c(0.51, 0.01, 0.81, 0.41, 0.62, 
0.98, 0.12, 0.15, 0.12)), .Names = c("ID", "Value"), 
class = "data.frame", row.names = c(NA, -9L))
RHertel
  • 21,502
  • 5
  • 34
  • 55
3

You could use the package dplyr and the function summarise_each:

df=data.frame(ID=c(1000,1000,1001,1001,1001,1002,1002,1002,1003), Value=c(0.51,0.01,0.81,0.41,0.62,0.98,0.12,0.15,0.12))

library(dplyr)
newdf <- df %>% group_by(ID) %>% summarise_each(funs(mean))

which gives you:

     ID     Value
  (dbl)     (dbl)
1  1000 0.2600000
2  1001 0.6133333
3  1002 0.4166667
4  1003 0.1200000

If you deal with large datasets this should be the most efficient way of doing this task.

Cleb
  • 20,118
  • 16
  • 91
  • 131
1

Using sqldf:

library(sqldf)
sqldf("SELECT ID, avg(Value) Mean 
      FROM df 
      GROUP BY ID")

Output:

    ID      Mean
1 1000 0.2600000
2 1001 0.6133333
3 1002 0.4166667
4 1003 0.1200000

With dplyr, instead of summarise_each as Cleb pointed out, we can just use summarise:

df %>% group_by(ID) %>% summarise(mean = mean(Value))
#or 
summarise(group_by(df, ID), mean = mean(Value))

Output:

    ID      mean
  (int)     (dbl)
1  1000 0.2600000
2  1001 0.6133333
3  1002 0.4166667
4  1003 0.1200000
mpalanco
  • 10,839
  • 1
  • 53
  • 63