0

I am trying to work with a data set that uses the value -4 instead of NA as the NA value. So I have found the following answer, which gets me close but I am not sure how to generalize this?

colnames(data)[colSums(is.na(data)) > 1000]

I tried using funcion(x)which(x <0) in place of is.na(data) but that did not go very well.

How can I achieve this aim?

Thanks in advance.

TheCodeNovice
  • 466
  • 8
  • 30
  • Is `dplyr::na_if()` what you need? https://dplyr.tidyverse.org/reference/na_if.html – Phil May 11 '21 at 16:08
  • Why don't you first replace these user-defined NAs with proper NAs, as in `data%>%mutate(across(everything(), ~na_if(., -4)))` first, and only then select your data? – GuedesBF May 11 '21 at 23:22

1 Answers1

2

One option using dplyr would be to count all of the -4 values, then select only the columns with a count of over 1000.

library(dplyr)

data %>% 
  summarize_all(~sum(. == -4)) %>% 
  select_if(~. > 1000) %>% 
  colnames()

To be even more explicit, you could convert the -4 values to NA first.

data %>% 
  na_if(-4) %>%
  summarize_all(~sum(is.na(.))) %>% 
  select_if(~. > 1000) %>% 
  colnames()

Or a base solution, slightly modifying your original code to count -4 values rather than NA values.

colnames(data)[colSums(data == -4) > 1000]
nniloc
  • 2,232
  • 2
  • 6
  • 18