0

I have a smaller list of levels that should not be collapsed ("Alberta", "British Columbia", "Ontario", "Quebec") than the ones that should (all else). I haven't been able to negate the levels (code as example of the goal) for fct_collapse (all but the following). Any suggestions?

df$`Province group` %<>% fct_collapse(df$Province, `Smaller provinces` = !c("Alberta", "British Columbia", "Ontario", "Quebec"))

ibm
  • 158
  • 1
  • 9

3 Answers3

1

I'm a bit confused by some of the syntax you're using here, but this solution should work for you! It uses dplyr's piping structure, and underscores instead of spaces in variable names (i.e. variable_name rather than `variable name`)

    library(dplyr)
    library(forcats)

    #What I imagine your df$Province variable looks like
    df <- tibble(Province = rep(c("Ontario", "Alberta", "Quebec", "British Columbia", "PEI", "Manitoba", "Nova Scotia"), 10))

    #Define your big provinces in this vector
    big_provinces <- c("Ontario", "Alberta", "Quebec", "British Columbia")

    #Modify the dataset (i.e. do the fct_collapse)
    df %>%
      mutate(Province_group =  fct_collapse(
                 Province, #For the variable "Province"
                 "Smaller provinces" = unique(Province[!(Province %in% big_provinces)]) #"Smaller provinces" is any province not in the vector big_province.
                 ) #end of fct_collapse
             ) #mutate

If "Provinces" is a factor variable, you'll need to convert it to a character variable first.

P.S. Hello from Quebec

R me matey
  • 368
  • 2
  • 9
1

fct_lump was the best solution for this problem (only because the logic of the question was to negate the 4 large-n provinces). If anyone finds a shorter solution than Rui Barradas I'd still be interested for future factor work.

df%>%
  mutate(`Compared to smaller provinces` = fct_lump(Province, n = 4)) %>%
  count(`Compared to smaller provinces`)

This produces 5 groups where "other" is all the other smaller-n response provinces.

ibm
  • 158
  • 1
  • 9
0

Here is a solution with levels to get the factor's levels. Then, subsetting the values not to be collapsed is done by negating %in%.

First recreate the data set in user @R me matey's answer.

library(magrittr)
library(dplyr)
library(forcats)

df <- tibble(Province = rep(c("Ontario", "Alberta", "Quebec", "British Columbia", "PEI", "Manitoba", "Nova Scotia"), 10))
df$Province <- factor(df$Province)

Now the question.

big_provinces <- c("Alberta", "British Columbia", "Ontario", "Quebec")

df %<>%
  mutate(Province = fct_collapse(Province, `Smaller provinces` = levels(Province)[!levels(Province) %in% big_provinces]))

df
## A tibble: 70 x 1
#   Province         
#   <fct>            
# 1 Ontario          
# 2 Alberta          
# 3 Quebec           
# 4 British Columbia 
# 5 Smaller provinces
# 6 Smaller provinces
# 7 Smaller provinces
# 8 Ontario          
# 9 Alberta          
#10 Quebec           
## ... with 60 more rows
Rui Barradas
  • 44,483
  • 8
  • 22
  • 48