I am not sure how to loop over each column to replace the NA values with the column mean. When I am trying to replace for one column using the following, it works well.

Column1[is.na(Column1)] <- round(mean(Column1, na.rm = TRUE))

The code for looping over columns is not working:

for(i in 1:ncol(data)){
    data[i][is.na(data[i])] <- round(mean(data[i], na.rm = TRUE))

the values are not replaced. Can someone please help me with this?

    Replacing missing values with the mean of a column is statistical malpractice. – IRTFM Sep 14 '14 at 17:10
  • @BondedDust The reason I did so was because if I ignored those NA values my data-set shrink to a very small number. Can you suggest what is the best way to handle such problems. If you could provide some link to a blog it would be great – Nikita Sep 14 '14 at 18:15
  • 2
    If you want to replace with something as a quick hack, you could try replacing the NA's like `mean(x) +rnorm(length(missing(x)))*sd(x)`. That will not take account of correlations between the missings (or the correlations of the measured), but at least it won't seriously inflate the significance of the results. Better would be to get experience with the packages that handle imputation of missing values. There are quite a few subtleties underlying the problem. – IRTFM Sep 14 '14 at 20:10
  • 1
    @42- I realize this comment's a couple years old. However, was the code literally meant `mean(x)+rnorm(length(missing(x)))*sd(x)`? When I run it, I get `Error in missing(x) : invalid use of 'missing'`. I expect the intention was to take the mean of the available values for x, then add rnorm(length of NAs)*sd(available values for x). Correct? I loved the malpractice line :-). I'm personally looking for a quick hack because I'm working with the '98 KDD cup dataset that has 120+ attributes with NAs. I'd like to drop most of them, and the instructions are to exclude only >= .995 NA . . . – Daniel Fletcher Aug 28 '16 at 03:47
  • By the way, this is what I inferred the intended code was: `mean(x, na.rm = T)+rnorm(sum(is.na(x)))*sd(x, na.rm = T)` – Daniel Fletcher Aug 28 '16 at 05:14
  • 2
    Was meant more as pseudo-code. Would need proper indexing but perhaps `rnorm( n=sum(is.na(x)) , mean=mean(x), sd=sd(x) )` would be closer to working code. – IRTFM Aug 28 '16 at 06:55

A relatively simple modification of your code should solve the issue:

for(i in 1:ncol(data)){
  data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
If DF is your data frame of numeric columns:



Using only the base of R define a function which does it for one column and then lapply to every column:

NA2mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
replace(DF, TRUE, lapply(DF, NA2mean))

The last line could be replaced with the following if it's OK to overwrite the input:

DF[] <- lapply(DF, NA2mean)
G. Grothendieck
  • 1
    Strange this doesn't have more upvotes or the best answer choice for that matter. Very succinct implementation. Thanks. – Ekaba Bisong Nov 19 '16 at 09:16

To add to the alternatives, using @akrun's sample data, I would do the following:

d1[] <- lapply(d1, function(x) { 
  x[is.na(x)] <- mean(x, na.rm = TRUE)
  • @A Handcart And Mohair. This is probably due to my limited background in programming: what does including `x` in the third line do? – Daniel Fletcher Aug 28 '16 at 04:20
  • Running the code a bit, I'm inferring the point, here, is to return the whole vector `x`, rather than just the replacement values, and then overwrite the whole df `d1` (per the open brackets `[]`), rather than overwriting only the `NA`s. – Daniel Fletcher Aug 28 '16 at 05:10
  • @DanielFletcher, That's pretty much it. – A5C1D2H2I1M1N2O1R2T1 Aug 28 '16 at 12:29

There is also quick solution using the imputeTS package:

Steffen Moritz
  • 1
    Honestly I think this is the best answer. Knew there had to be some function in another package to do this common task. – wordsforthewise Sep 27 '19 at 15:40
  • 1
    ImputeTS gives good results in my opinion. There is another option included in this package and based on Kalman filters. ImputeTS developers also recommend it on their [info pages](https://cran.r-project.org/web/packages/imputeTS/vignettes/imputeTS-Time-Series-Missing-Value-Imputation-in-R.pdf). You can use it with the code. `na_kalman(yourDataFrame)` – NCC1701 Sep 25 '20 at 21:47

dplyr's mutate_all or mutate_at could be useful here:


df <- data.frame(a = sample(c(NA, 1:3)    , replace = TRUE, 10),           
                 b = sample(c(NA, 101:103), replace = TRUE, 10),                            
                 c = sample(c(NA, 201:203), replace = TRUE, 10))                            


#>     a   b   c
#> 1   2 102 203
#> 2   1 102 202
#> 3   1  NA 203
#> 4   2 102 201
#> 5  NA 101 201
#> 6  NA 101 202
#> 7   1  NA 203
#> 8   1 101  NA
#> 9   2 101 203
#> 10  1 103 201

df %>% mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))          

#>        a       b        c
#> 1  2.000 102.000 203.0000
#> 2  1.000 102.000 202.0000
#> 3  1.000 101.625 203.0000
#> 4  2.000 102.000 201.0000
#> 5  1.375 101.000 201.0000
#> 6  1.375 101.000 202.0000
#> 7  1.000 101.625 203.0000
#> 8  1.000 101.000 202.1111
#> 9  2.000 101.000 203.0000
#> 10 1.000 103.000 201.0000

df %>% mutate_at(vars(a, b),~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))

#>        a       b   c
#> 1  2.000 102.000 203
#> 2  1.000 102.000 202
#> 3  1.000 101.625 203
#> 4  2.000 102.000 201
#> 5  1.375 101.000 201
#> 6  1.375 101.000 202
#> 7  1.000 101.625 203
#> 8  1.000 101.000  NA
#> 9  2.000 101.000 203
#> 10 1.000 103.000 201
lapply can be used instead of a for loop.

d1[] <- lapply(d1, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))

This doesn't really have any advantages over the for loop, though maybe it's easier if you have non-numeric columns as well, in which case

d1[sapply(d1, is.numeric)] <- lapply(d1[sapply(d1, is.numeric)], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))

is almost as easy.

  • Interestingly, after lapply, my "gather" commands from dplyr don't work. :( I posted this on a different question. – fiacobelli Jun 26 '17 at 19:04

You could also try:

 cM <- colMeans(d1, na.rm=TRUE)
 indx <- which(is.na(d1), arr.ind=TRUE)
 d1[indx] <- cM[indx[,2]]


d1 <- as.data.frame(matrix(sample(c(NA,0:5), 5*10, replace=TRUE), ncol=10))
A one-liner using tidyr's replace_na is


If your df has columns that are non-numeric, this takes a little bit more work than a one-liner.

mean_to_fill <- select_if(ungroup(df), is.numeric) %>%

bind_cols(select(df, group1, group2, group3),
          select_if(ungroup(df), is.numeric) %>% 
Matias Andina
Marcus Ritt
Go simply with Zoo, it will simply replace all NA values with mean of the column values:

  • 632
  • 8
  • 10
# Lets say I have a dataframe , df as following -
df <- data.frame(a=c(2,3,4,NA,5,NA),b=c(1,2,3,4,NA,NA))

# create a custom function
fillNAwithMean <- function(x){
    na_index <- which(is.na(x))        
    mean_x <- mean(x, na.rm=T)
    x[na_index] <- mean_x

(df <- apply(df,2,fillNAwithMean))
   a   b
2.0 1.0
3.0 2.0
4.0 3.0
3.5 4.0
5.0 2.5
3.5 2.5
Similar to the answer pointed out by @Thomas, This can also be done using ifelse() method of R:

for(i in 1:ncol(data)){
                  ave(data[,i],FUN=function(y) mean(y, na.rm = TRUE)),

where, Arguments to ifelse(TEST, YES , NO) are:-

TEST- logical condition to be checked

YES- executed if the condition is True

NO- else when the condition is False

and ave(x, ..., FUN = mean) is method in R used for calculating averages of subsets of x[]

Aseem Yadav
With the data.table package you could use the set() function and loop over the columns and replace the NAs or whatever you like with an aggregate or value of your choice (here: mean):


# data
dt = copy(iris[ ,-5])
dt[1:4, Sepal.Length := NA] # introduce NAs

# replace NAs with mean (or whatever function you like)
for (j in seq_along(names(dt))) {
      i = which(is.na(dt[[j]])),
      j = j, 
      value = mean(dt[[j]], na.rm = TRUE))
