I have a large dataframe and would like to drop any rows where 20% of the cells are empty. I can't seem to find a code to do this efficiently.
Asked
Active
Viewed 39 times
0
-
If you can find the number missing in a row, you can easily find the percentage, so I guess something like https://stackoverflow.com/questions/37801338/r-count-nas-per-row-in-dataframe would be super helpful – thelatemail Jul 18 '19 at 04:46
-
Welcome to stack overflow. Please add sample data that makes your problem / question fully reproducible. Are your columns numeric or character? how many columns / rows is your data? Give us more information, and we can help you better. – PavoDive Jul 18 '19 at 04:51
2 Answers
0
Lacking a lot of information, I think you can solve it with the help of data.table
. See ?data.table
for more information.
library(data.table)
# make a reproducible example
set.seed(1)
dt <- data.table(matrix(sample(c(NA, 1), 500, TRUE, prob = c(0.78, 0.22)),
ncol = 50))
Apply a function that finds how many non-NA values there are per row, compare them to the maximum allowed and return only those columns that match:
dt[, test := apply(.SD, 1, function(x) sum(!is.na(x))) > .2*50,][test == TRUE, ]
![](../../users/profiles/3402703.webp)
PavoDive
- 5,293
- 20
- 50
-
-
In my sample data all values are either 1 or NA, so `mean` would work. However, in a more general case the values will be different and it won't hold. – PavoDive Jul 18 '19 at 10:54
0
With map
and dplyr
you could try this:
df:
df <- tibble(id = sequence(6),
var1 = c(NA, NA, NA, "x", "x", "X"),
var2 = c(NA, NA, NA, "x", "x", "X"),
var3 = c(NA, "Y", NA, "x", "x", "X"),
var4 = c(NA, NA, NA, "x", "x", "X"))
Then nest and calculate the perc_na in each row and filter those out:
df1 <- df %>%
nest(-id) %>%
mutate(
perc_na = map_dbl(data, ~ sum(is.na(.x)) / ncol(.x))
) %>%
filter(perc_na <= 0.8) %>%
select(-perc_na) %>%
unnest()
![](../../users/profiles/11759003.webp)
tvdo
- 141
- 3