83

Here's the link of my data.

My target is to assign "NA" to all blank cells irrespective of categorical or numerical values. I am using na.strings="". But it's not assigning NA to all blank cells.

## reading the data
dat <- read.csv("data2.csv")
head(dat)
  mon hr        acc   alc sex spd axles door  reg                                 cond1 drug1
1   8 21 No Control  TRUE   F   0     2    2      Physical Impairment (Eyes, Ear, Limb)     A
2   7 20 No Control FALSE   M 900     2    2                                Inattentive     D
3   3  9 No Control FALSE   F 100     2    2 2004                                Normal     D
4   1 15 No Control FALSE   M   0     2    2      Physical Impairment (Eyes, Ear, Limb)     D
5   4 21 No Control FALSE      25    NA   NA                                                D
6   4 20 No Control    NA   F  30     2    4                Drinking Alcohol - Impaired     D
       inj1 PED_STATE st rac1
1     Fatal      <NA>  F <NA>
2  Moderate      <NA>  F <NA>
3  Moderate      <NA>  M <NA>
4 Complaint      <NA>  M <NA>
5 Complaint      <NA>  F <NA>
6  Moderate      <NA>  M <NA>


## using na.strings
dat2 <- read.csv("data2.csv", header=T, na.strings="")
head(dat2)
  mon hr        acc   alc sex spd axles door  reg                                 cond1 drug1
1   8 21 No Control  TRUE   F   0     2    2 <NA> Physical Impairment (Eyes, Ear, Limb)     A
2   7 20 No Control FALSE   M 900     2    2 <NA>                           Inattentive     D
3   3  9 No Control FALSE   F 100     2    2 2004                                Normal     D
4   1 15 No Control FALSE   M   0     2    2 <NA> Physical Impairment (Eyes, Ear, Limb)     D
5   4 21 No Control FALSE      25    NA   NA <NA>                                  <NA>     D
6   4 20 No Control    NA   F  30     2    4 <NA>           Drinking Alcohol - Impaired     D
       inj1 PED_STATE st rac1
1     Fatal        NA  F   NA
2  Moderate        NA  F   NA
3  Moderate        NA  M   NA
4 Complaint        NA  M   NA
5 Complaint        NA  F   NA
6  Moderate        NA  M   NA
S Das
  • 2,623
  • 5
  • 21
  • 40
  • Please [use text, not images/links, for text--including tables & ERDs.](https://meta.stackoverflow.com/q/285551/3404097) Paraphrase or quote from other text. Use images only for what cannot be expressed as text or to augment text. Images cannot be searched for or cut & pasted. Include a legend/key & explanation with an image. Make your post self-contained. Insert images/links using edit functions. – philipxy Jul 01 '19 at 01:05

14 Answers14

100

I'm assuming you are talking about row 5 column "sex." It could be the case that in the data2.csv file, the cell contains a space and hence is not considered empty by R.

Also, I noticed that in row 5 columns "axles" and "door", the original values read from data2.csv are string "NA". You probably want to treat those as na.strings as well. To do this,

dat2 <- read.csv("data2.csv", header=T, na.strings=c("","NA"))

EDIT:

I downloaded your data2.csv. Yes, there is a space in row 5 column "sex". So you want

na.strings=c(""," ","NA")
Badoe
  • 1,310
  • 1
  • 10
  • 12
36

You can use gsub to replace multiple mutations of empty, like "" or a space, to be NA:

data= data.frame(cats=c('', ' ', 'meow'), dogs=c("woof", " ", NA))
apply(data, 2, function(x) gsub("^$|^ $", NA, x))
sclarky
  • 681
  • 5
  • 10
  • 3
    Can also use `gsub("^$", NA, trimws(x))` to handle more than one space within a cell. Although, beware both of these approaches convert all columns to string/character variables (if not already). – JWilliman Jan 30 '19 at 21:14
28

A more eye-friendly solution using dplyr would be

require(dplyr)

## fake blank cells
iris[1,1]=""

## define a helper function
empty_as_na <- function(x){
    if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
    ifelse(as.character(x)!="", x, NA)
}

## transform all columns
iris %>% mutate_each(funs(empty_as_na)) 

To apply the correction to just a subset of columns you can specify columns of interest using dplyr's column matching syntax. Example:mutate_each(funs(empty_as_na), matches("Width"), Species)

In case you table contains dates you should consider using a more typesafe version of ifelse

Holger Brandl
  • 7,903
  • 54
  • 52
  • 11
    How is adding a new library, creating a new function is more *eye-friendly*? And I think you will need `ifelse(x %in% c(""," ","NA"), NA, x)`. – zx8754 Mar 10 '16 at 08:47
  • 4
    Using a function along with `mutate_each` gives more flexibility and a reusable pattern. `dplyr` is omnipresent in nowadays R-workflows and was just added to make the answer self-contained. I think `x!=""` is correct here, since neither " " nor "NA" are blank. Furthermore @sclarky's answer fails for data-frames containing numbers, and @Badoe's does not really solve the problem for existing data.frames, so no other answer seems to answer the question yet in a generic fashion. I'm happy to learn about a better solutions. – Holger Brandl Mar 10 '16 at 09:30
  • 1
    *dplyr is omnipresent in nowadays R-workflows*- no its not. And what does "*and @Badoe's does not really solve the problem for existing data.frames*" even means? Can you expand a bit on that statement? – David Arenburg Mar 10 '16 at 10:39
  • 10
    Badoe details out how to configure `read.csv` to transform blank cells to NA when reading a table from a file. However, since the title of the question is "Change the Blank Cells to “NA”" a complete answer should imho cover the situation where a data.frame is already in the environment and the user wants to get rid of blank cells. – Holger Brandl Mar 10 '16 at 22:18
  • This answer is also very usefull if you read excel files, where the na.strings option doesn't exist – Marinka Sep 12 '18 at 15:44
  • 1
    This might not be what the OP was looking for, but it helped me count missing values including empty strings and NAs. `df %>% mutate_all(funs(empty_as_na)) %>% summarize_all(funs(sum(is.na(.))))` While dplyr may or may not be widespread in adoption, it does enjoy popularity in a large subset of R users, including me, so thanks for this solution. – Dannid Dec 03 '18 at 17:51
25

This should do the trick

dat <- dat %>% mutate_all(na_if,"")
Jonathan
  • 281
  • 3
  • 6
  • 1
    I tried that on an sf object, and it threw a parsing error: unknown WKB type 12. Seems mutate tried to replace something in the geometries. – aae Jun 05 '19 at 13:08
15

I recently ran into similar issues, and this is what worked for me.

If the variable is numeric, then a simple df$Var[df$Var == ""] <- NA should suffice. But if the variable is a factor, then you need to convert it to the character first, then replace "" cells with the value you want, and convert it back to factor. So case in point, your Sex variable, I assume it would be a factor and if you want to replace the empty cell, I would do the following:

df$Var <- as.character(df$Var)
df$Var[df$Var==""] <- NA
df$Var <- as.factor(df$Var)
morgan121
  • 1,989
  • 1
  • 12
  • 27
ZKH
  • 151
  • 1
  • 2
6

As of (dplyr 1.0.0) we can use across()

For all columns:

dat <- dat %>%
   mutate(across(everything(), ~ifelse(.=="", NA, as.character(.))))

For individual columns:

dat <- dat %>%
   mutate(across(c("Age","Gender"), ~ifelse(.=="", NA, as.character(.))))

As of (dplyr 0.8.0 above) the way this should be written has changed. Before it was, funs() in .funs (funs(name = f(.)). Instead of funs, now we use list (list(name = ~f(.)))

Note that there is also a much simpler way to list the column names ! (both the name of the column and column index work)

dat <- dat %>%
mutate_at(.vars = c("Age","Gender"),
    .funs = list(~ifelse(.=="", NA, as.character(.))))

Original Answer:

You can also use mutate_at in dplyr

dat <- dat %>%
mutate_at(vars(colnames(.)),
        .funs = funs(ifelse(.=="", NA, as.character(.))))

Select individual columns to change:

dat <- dat %>%
mutate_at(vars(colnames(.)[names(.) %in% c("Age","Gender")]),
        .funs = funs(ifelse(.=="", NA, as.character(.))))
camnesia
  • 1,473
  • 13
  • 19
  • `~ifelse(.=="", NA, as.character(.))))` Could you please tell me what does `.==` means? – Daman deep May 06 '21 at 05:06
  • `==` is used for equality testing and `.` references variables it is testing - for example for `across(c("Age","Gender")`, `.` would reference `"Age"` and then `"Gender"` – camnesia May 06 '21 at 08:10
3

My function takes into account factor, character vector and potential attributes, if you use haven or foreign package to read external files. Also it allows matching different self-defined na.strings. To transform all columns, simply use lappy: df[] = lapply(df, blank2na, na.strings=c('','NA','na','N/A','n/a','NaN','nan'))

See more the comments:

#' Replaces blank-ish elements of a factor or character vector to NA
#' @description Replaces blank-ish elements of a factor or character vector to NA
#' @param x a vector of factor or character or any type
#' @param na.strings case sensitive strings that will be coverted to NA. The function will do a trimws(x,'both') before conversion. If NULL, do only trimws, no conversion to NA.
#' @return Returns a vector trimws (always for factor, character) and NA converted (if matching na.strings). Attributes will also be kept ('label','labels', 'value.labels').
#' @seealso \code{\link{ez.nan2na}}
#' @export
blank2na = function(x,na.strings=c('','.','NA','na','N/A','n/a','NaN','nan')) {
    if (is.factor(x)) {
        lab = attr(x, 'label', exact = T)
        labs1 <- attr(x, 'labels', exact = T)
        labs2 <- attr(x, 'value.labels', exact = T)

        # trimws will convert factor to character
        x = trimws(x,'both')
        if (! is.null(lab)) lab = trimws(lab,'both')
        if (! is.null(labs1)) labs1 = trimws(labs1,'both')
        if (! is.null(labs2)) labs2 = trimws(labs2,'both')

        if (!is.null(na.strings)) {
            # convert to NA
            x[x %in% na.strings] = NA
            # also remember to remove na.strings from value labels 
            labs1 = labs1[! labs1 %in% na.strings]
            labs2 = labs2[! labs2 %in% na.strings]
        }

        # the levels will be reset here
        x = factor(x)

        if (! is.null(lab)) attr(x, 'label') <- lab
        if (! is.null(labs1)) attr(x, 'labels') <- labs1
        if (! is.null(labs2)) attr(x, 'value.labels') <- labs2
    } else if (is.character(x)) {
        lab = attr(x, 'label', exact = T)
        labs1 <- attr(x, 'labels', exact = T)
        labs2 <- attr(x, 'value.labels', exact = T)

        # trimws will convert factor to character
        x = trimws(x,'both')
        if (! is.null(lab)) lab = trimws(lab,'both')
        if (! is.null(labs1)) labs1 = trimws(labs1,'both')
        if (! is.null(labs2)) labs2 = trimws(labs2,'both')

        if (!is.null(na.strings)) {
            # convert to NA
            x[x %in% na.strings] = NA
            # also remember to remove na.strings from value labels 
            labs1 = labs1[! labs1 %in% na.strings]
            labs2 = labs2[! labs2 %in% na.strings]
        }

        if (! is.null(lab)) attr(x, 'label') <- lab
        if (! is.null(labs1)) attr(x, 'labels') <- labs1
        if (! is.null(labs2)) attr(x, 'value.labels') <- labs2
    } else {
        x = x
    }
    return(x)
}
Jerry T
  • 1,159
  • 14
  • 15
2

While many options above function well, I found coercion of non-target variables to chr problematic. Using ifelse and grepl within lapply resolves this off-target effect (in limited testing). Using slarky's regular expression in grepl:

set.seed(42)
x1 <- sample(c("a","b"," ", "a a", NA), 10, TRUE)
x2 <- sample(c(rnorm(length(x1),0, 1), NA), length(x1), TRUE)

df <- data.frame(x1, x2, stringsAsFactors = FALSE)

The problem of coercion to character class:

df2 <- lapply(df, function(x) gsub("^$|^ $", NA, x))
lapply(df2, class)

$x1 [1] "character"

$x2 [1] "character"

Resolution with use of ifelse:

df3 <- lapply(df, function(x) ifelse(grepl("^$|^ $", x)==TRUE, NA, x))
lapply(df3, class)

$x1 [1] "character"

$x2 [1] "numeric"

Todd D
  • 213
  • 1
  • 13
2

I suspect everyone has an answer already, though in case someone comes looking, dplyr na_if() would be (from my perspective) the more efficient of those mentioned:

# Import CSV, convert all 'blank' cells to NA
dat <- read.csv("data2.csv") %>% na_if("")

Here is an additional approach leveraging readr's read_delim function. I just picked-up (probably widely know, but I'll archive here for future users). This is very straight forward and more versatile than the above, as you can capture all types of blank and NA related values in your csv file:

dat <- read_csv("data2.csv", na = c("", "NA", "N/A"))

Note the underscore in readr's version versus Base R "." in read_csv.

Hopefully this helps someone who wanders upon the post!

glenn_in_boston
  • 213
  • 2
  • 8
1

I think data.table is your best bet (for code simplicity and speed). The following would replace all blanks in a dataset called "data" with NA:

data[data==""] <- NA
M. Warren
  • 21
  • 2
0

Couldn't you just use

dat <- read.csv("data2.csv",na.strings=" ",header=TRUE)

should convert all blanks to NA as the data are read in be sure to put a space between your quotation

pyll
  • 1,532
  • 18
  • 39
0

For those wondering about a solution using the data.table way, here is one I wrote a function for, available on my Github:

library(devtools)
source_url("https://github.com/YoannPa/Miscellaneous/blob/master/datatable_pattern_substitution.R?raw=TRUE")
dt.sub(DT = dat2, pattern = "^$|^ $",replacement = NA)
dat2

The function goes through each column, to identify which column contains pattern matches. Then gsub() is aplied only on columns containing matches for the pattern "^$|^ $", to substitutes matches by NAs.

I will keep improving this function to make it faster.

Yoann Pageaud
  • 345
  • 3
  • 14
0

This works for me.

dataset <- read.csv(file = "data.csv",header=TRUE,fill = T,na.strings = "")
Dale K
  • 16,372
  • 12
  • 37
  • 62
-3

Call dplyr package by installing from cran in r

library(dplyr)

(file)$(colname)<-sub("-",NA,file$colname) 

It will convert all the blank cell in a particular column as NA

If the column contains "-", "", 0 like this change it in code according to the type of blank cell

E.g. if I get a blank cell like "" instead of "-", then use this code:

(file)$(colname)<-sub("", NA, file$colname)