1

if I have a data frame df

df=data.frame(x=1:20,y=c(1:10,rep(NA,10)),z=c(rep(NA,5),1:15))

I know to replace NAs with mean value for a given column is, we can use

df[is.na(df$x)]=mean(df$x,na.rm=T)

What I am trying to find is a way to use a single command so that it does this for the columns at once instead of repeating it for every column.

Suspecting, I need to use sapply and function, I tried something like this but clearly this does not work

sapply(df,function(x) df[is.na(df$x)]=mean(df$x,na.rm=T))

Any suggestions would be great. I tried to search previous post but I could not find a similar problem being addressed.

PagMax
  • 6,209
  • 7
  • 21
  • 36

1 Answers1

2

We can use na.aggregate. One option would be to separately apply the na.aggregate on each column. We can do this with lapply. If we are using data.table, convert the 'data.frame' to 'data.table' (setDT(df)), loop over the columns and apply na.aggregate. This will replace NA with the mean of the non-NA values.

library(zoo)
library(data.table)
setDT(df)[, names(df) := lapply(.SD, na.aggregate)][]
#     x    y  z
# 1:  1  1.0  8
# 2:  2  2.0  8
# 3:  3  3.0  8
# 4:  4  4.0  8
# 5:  5  5.0  8
# 6:  6  6.0  1
# 7:  7  7.0  2
# 8:  8  8.0  3
# 9:  9  9.0  4
#10: 10 10.0  5
#11: 11  5.5  6
#12: 12  5.5  7
#13: 13  5.5  8
#14: 14  5.5  9
#15: 15  5.5 10
#16: 16  5.5 11
#17: 17  5.5 12
#18: 18  5.5 13
#19: 19  5.5 14
#20: 20  5.5 15

Or we can use na.aggregate directly on the dataset.

na.aggregate(df)
akrun
  • 674,427
  • 24
  • 381
  • 486
  • Thanks and works like a charm. Cannot believe there is such a simple command. – PagMax Feb 04 '16 at 09:21
  • Also, since you marked it duplicate, can you point me to the previous query? This will help me in searching better next time. If one knows about na.aggregate, then it is easier to search, but I am not sure how do I look for it if I did not know about na.aggregate at all. – PagMax Feb 04 '16 at 09:21
  • @PagMax The link is already there in the post. http://stackoverflow.com/questions/25835643/replacing-missing-values-in-r-with-column-mean – akrun Feb 04 '16 at 09:23
  • @PagMax I understand that. It is sometimes difficult to choose the right keyword. The duplicate was identified by Pascal. If I use `r change NA value with mean` in google, the link also pops up. – akrun Feb 04 '16 at 09:25
  • 1
    Thanks akrun. I guess I am new to stackoverflow as well and learning it as I learn R. (I thought your full name was akrun pascal!!). Anyway, I went over the previous post and sure enough I should have found that one first. – PagMax Feb 04 '16 at 09:31
  • It turns out that the data.table and lapply solution works better for me. My data frame has lot of logical and factor variables and when I directly use na.aggregate, it converts all the columns (even the numerical ones) to chr. However, if I loop with lapply, it does lot better and exactly as I want. Not sure why it behaves differently, but I am glad that I have a working solution – PagMax Feb 04 '16 at 12:27
  • @PagMax For datasets with mixed classes, it is always good to loop with `lapply` to be safer. I think the `na.aggregate` would do some sort of `matrix` conversion behind the hood and as `matrix` can hold only a single class and if there is any non-numeric column, it will convert all the columns to 'character', this might be the issue. – akrun Feb 04 '16 at 12:33