R - How to drop rows with certain % of empty cells?

Question

I have a large dataframe and would like to drop any rows where 20% of the cells are empty. I can't seem to find a code to do this efficiently.

If you can find the number missing in a row, you can easily find the percentage, so I guess something like https://stackoverflow.com/questions/37801338/r-count-nas-per-row-in-dataframe would be super helpful — thelatemail, Jul 18 '19 at 04:46
Welcome to stack overflow. Please add sample data that makes your problem / question fully reproducible. Are your columns numeric or character? how many columns / rows is your data? Give us more information, and we can help you better. — PavoDive, Jul 18 '19 at 04:51

score 0 · Answer 1 · answered Jul 18 '19 at 05:03

0

Lacking a lot of information, I think you can solve it with the help of data.table. See ?data.table for more information.

library(data.table)

# make a reproducible example    
set.seed(1)
dt <- data.table(matrix(sample(c(NA, 1), 500, TRUE, prob = c(0.78, 0.22)), 
                 ncol = 50))

Apply a function that finds how many non-NA values there are per row, compare them to the maximum allowed and return only those columns that match:

dt[, test := apply(.SD, 1, function(x) sum(!is.na(x))) > .2*50,][test == TRUE, ]

answered Jul 18 '19 at 05:03

PavoDive

5,293
20
50

Replace `sum(!is.na(x)) > .2*50` with `mean(!is.na(x)) > 0.2` – pogibas Jul 18 '19 at 06:25
In my sample data all values are either 1 or NA, so `mean` would work. However, in a more general case the values will be different and it won't hold. – PavoDive Jul 18 '19 at 10:54

score 0 · Answer 2 · answered Jul 18 '19 at 06:39

With map and dplyr you could try this:

df:

df <- tibble(id = sequence(6), 
             var1 = c(NA, NA, NA, "x", "x", "X"), 
             var2 = c(NA, NA, NA, "x", "x", "X"), 
             var3 = c(NA, "Y", NA, "x", "x", "X"), 
             var4 = c(NA, NA, NA, "x", "x", "X"))

Then nest and calculate the perc_na in each row and filter those out:

df1 <- df %>% 
  nest(-id) %>% 
  mutate(
    perc_na = map_dbl(data, ~ sum(is.na(.x)) / ncol(.x))
  ) %>% 
  filter(perc_na <= 0.8) %>% 
  select(-perc_na) %>% 
  unnest()

R - How to drop rows with certain % of empty cells?

2 Answers2