7

Why should I use | vs any() when I'm comparing columns in dplyr::mutate()?

And why do they return different answers?

For example:

library(tidyverse)
df  <- data_frame(x = rep(c(T,F,T), 4), y = rep(c(T,F,T, F), 3), allF  = F, allT = T)

 df %>%
     mutate( 
          withpipe = x | y # returns expected results by row
        , usingany = any(c(x,y)) # returns TRUE for every row
     )

What's going on here and why should I use one way of comparing values over another?

crazybilly
  • 2,652
  • 13
  • 38

3 Answers3

5

The difference between the two is how the answer is calculated:

  • for |, elements are compared row-wise and boolean logic is used to return the proper value. In the example above each x and y pair are compared to each other and a logical value is returned for each pair, resulting in 12 different answers, one for each row of the data frame.
  • any(), on the other hand, looks at the entire vector and returns a single value. In the above example, the mutate line that calculates the new usingany column is basically doing this: any(c(df$x, df$y)), which will return TRUE because there's at least one TRUE value in either df$x or df$y. That single value is then assigned to every row of the data frame.

You can see this in action using the other columns in your data frame:

df %>% 
    mutate(
        usingany = any(c(x,y)) # returns all TRUE
      , allfany  = any(allF)   # returns all FALSE because every value in df$allF is FALSE
    )

To answer when you should use which: use | when you want to compare elements row-wise. Use any() when you want a universal answer about the entire data frame.

TLDR, when using dplyr::mutate(), you're usually going to want to use |.

crazybilly
  • 2,652
  • 13
  • 38
3

You can also use rowwise().

df  <- data_frame(x = rep(c(T,F,T), 4), y = rep(c(T,F,T, F), 3), allF  = F, allT = T)

 df %>%
     rowwise() %>%
     mutate(x_or_y = any(x,y))

Output:

# A tibble: 12 x 5  
    x     y     allF  allT  x_or_y  
    <lgl> <lgl> <lgl> <lgl> <lgl>   
  1 TRUE  TRUE  FALSE TRUE  TRUE   
  2 FALSE FALSE FALSE TRUE  FALSE  
  3 TRUE  TRUE  FALSE TRUE  TRUE   
  4 TRUE  FALSE FALSE TRUE  TRUE   
  5 FALSE TRUE  FALSE TRUE  TRUE   
  6 TRUE  FALSE FALSE TRUE  TRUE   
  7 TRUE  TRUE  FALSE TRUE  TRUE   
  8 FALSE FALSE FALSE TRUE  FALSE  
  9 TRUE  TRUE  FALSE TRUE  TRUE  
 10 TRUE  FALSE FALSE TRUE  TRUE  
 11 FALSE TRUE  FALSE TRUE  TRUE  
 12 TRUE  FALSE FALSE TRUE  TRUE  
MokeEire
  • 586
  • 5
  • 16
-1

You can use both the OR operator | or any()

It is the same thing when comparing &and all().

As suggested, you must take into account that |is vectorized, while any() is not

In order to use any() the same way, you must group the data rowwise, so you can call an equivalent of any(current_row). This can be done with purrr::pmap or dplyr::rowwise:

Se the code below for a comparison of all methods:

df%>%mutate(row_OR=x|y,
            row_pmap_any=pmap_lgl(select(.,c(x,y)), any))%>%
        rowwise()%>%
        mutate(row_rowwise_any=any(c_across(c(x,y))))

# A tibble: 12 x 7
# Rowwise: 
   x     y     allF  allT  row_OR row_pmap_any row_rowwise_any
   <lgl> <lgl> <lgl> <lgl> <lgl>  <lgl>        <lgl>          
 1 TRUE  TRUE  FALSE TRUE  TRUE   TRUE         TRUE           
 2 FALSE FALSE FALSE TRUE  FALSE  FALSE        FALSE          
 3 TRUE  TRUE  FALSE TRUE  TRUE   TRUE         TRUE           
 4 TRUE  FALSE FALSE TRUE  TRUE   TRUE         TRUE           
 5 FALSE TRUE  FALSE TRUE  TRUE   TRUE         TRUE           
 6 TRUE  FALSE FALSE TRUE  TRUE   TRUE         TRUE           
 7 TRUE  TRUE  FALSE TRUE  TRUE   TRUE         TRUE           
 8 FALSE FALSE FALSE TRUE  FALSE  FALSE        FALSE          
 9 TRUE  TRUE  FALSE TRUE  TRUE   TRUE         TRUE           
10 TRUE  FALSE FALSE TRUE  TRUE   TRUE         TRUE           
11 FALSE TRUE  FALSE TRUE  TRUE   TRUE         TRUE           
12 TRUE  FALSE FALSE TRUE  TRUE   TRUE         TRUE 

All methods work, and I did not find much difference in performance.

GuedesBF
  • 938
  • 1
  • 6
  • 18