Data Wrangle in R - How do I list out columns that have over a certain number of a specific value from a tibble

Question

I am trying to work with a data set that uses the value -4 instead of NA as the NA value. So I have found the following answer, which gets me close but I am not sure how to generalize this?

colnames(data)[colSums(is.na(data)) > 1000]

I tried using funcion(x)which(x <0) in place of is.na(data) but that did not go very well.

How can I achieve this aim?

Thanks in advance.

Is `dplyr::na_if()` what you need? https://dplyr.tidyverse.org/reference/na_if.html — Phil, May 11 '21 at 16:08
Why don't you first replace these user-defined NAs with proper NAs, as in `data%>%mutate(across(everything(), ~na_if(., -4)))` first, and only then select your data? — GuedesBF, May 11 '21 at 23:22

nniloc · Answer 1 · 2021-05-11T17:16:09.867

One option using dplyr would be to count all of the -4 values, then select only the columns with a count of over 1000.

library(dplyr)

data %>% 
  summarize_all(~sum(. == -4)) %>% 
  select_if(~. > 1000) %>% 
  colnames()

To be even more explicit, you could convert the -4 values to NA first.

data %>% 
  na_if(-4) %>%
  summarize_all(~sum(is.na(.))) %>% 
  select_if(~. > 1000) %>% 
  colnames()

Or a base solution, slightly modifying your original code to count -4 values rather than NA values.

colnames(data)[colSums(data == -4) > 1000]

Data Wrangle in R - How do I list out columns that have over a certain number of a specific value from a tibble

1 Answers1