61

What is the most effective (ie efficient / appropriate) way to clean up a factor containing multiple levels that need to be collapsed? That is, how to combine two or more factor levels into one.

Here's an example where the two levels "Yes" and "Y" should be collapsed to "Yes", and "No" and "N" collapsed to "No":

## Given: 
x <- c("Y", "Y", "Yes", "N", "No", "H")   # The 'H' should be treated as NA

## expectedOutput
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No  # <~~ NOTICE ONLY **TWO** LEVELS

One option is of course to clean the strings before hand using sub and friends.

Another method, is to allow duplicate label, then drop them

## Duplicate levels ==> "Warning: deprecated"
x.f <- factor(x, levels=c("Y", "Yes", "No", "N"), labels=c("Yes", "Yes", "No", "No"))

## the above line can be wrapped in either of the next two lines
factor(x.f)      
droplevels(x.f) 

However, is there a more effective way?


While I know that the levels and labels arguments should be vectors, I experimented with lists and named lists and named vectors to see what happens Needless to say, none of the following got me any closer to my goal.

  factor(x, levels=list(c("Yes", "Y"), c("No", "N")), labels=c("Yes", "No"))
  factor(x, levels=c("Yes", "No"), labels=list(c("Yes", "Y"), c("No", "N")))

  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))
  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Yes="Y", Yes="Yes", No="No", No="N"))
  factor(x, levels=c("Yes", "No"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))
Will Ness
  • 62,652
  • 8
  • 86
  • 167
Ricardo Saporta
  • 51,025
  • 13
  • 129
  • 166
  • 1
    Haven't tested this yet, but the R 3.5.0 (2018-04-23) release notes say "factor(x, levels, labels) now allows duplicated labels (not duplicated levels!). Hence you can map different values of x to the same level directly." – Aaron left Stack Overflow Apr 26 '18 at 04:51

10 Answers10

83

UPDATE 2: See Uwe's answer which shows the new "tidyverse" way of doing this, which is quickly becoming the standard.

UPDATE 1: Duplicated labels (but not levels!) are now indeed allowed (per my comment above); see Tim's answer.

ORIGINAL ANSWER, BUT STILL USEFUL AND OF INTEREST: There is a little known option to pass a named list to the levels function, for exactly this purpose. The names of the list should be the desired names of the levels and the elements should be the current names that should be renamed. Some (including the OP, see Ricardo's comment to Tim's answer) prefer this for ease of reading.

x <- c("Y", "Y", "Yes", "N", "No", "H", NA)
x <- factor(x)
levels(x) <- list("Yes"=c("Y", "Yes"), "No"=c("N", "No"))
x
## [1] Yes  Yes  Yes  No   No   <NA>  <NA>
## Levels: Yes No

As mentioned in the levels documentation; also see the examples there.

value: For the 'factor' method, a vector of character strings with length at least the number of levels of 'x', or a named list specifying how to rename the levels.

This can also be done in one line, as Marek does here: https://stackoverflow.com/a/10432263/210673; the levels<- sorcery is explained here https://stackoverflow.com/a/10491881/210673.

> `levels<-`(factor(x), list(Yes=c("Y", "Yes"), No=c("N", "No")))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No
Aaron left Stack Overflow
  • 34,320
  • 5
  • 72
  • 135
  • +1 more robust and I would imagine a lot safer than my attempt. – Simon O'Hanlon Oct 16 '13 at 17:48
  • Thanks Aaron, I like this approach in that it at least avoids the warnings associated with `droplevles(factor(x, ...))` but I remain curious as to any more direct methods. eg: If it were possible to use `levels=` right in the `factor(.)` call) – Ricardo Saporta Oct 16 '13 at 18:20
  • 2
    Agree that it's odd this can't be done within `factor`; I don't know of a more direct way, except for using something like Ananda's solution or perhaps something with match. – Aaron left Stack Overflow Oct 16 '13 at 19:06
  • 1
    This also works for `ordered` and the collapsed levels are ordered as they are supplied, for example `a = ordered(c(1, 2, 3)); levels(a) = list("3" = 3, "1,2" = c(1, 2))` yields the ordering `Levels: 3 < 1,2`. – asnr Nov 17 '15 at 01:27
27

As the question is titled Cleaning up factor levels (collapsing multiple levels/labels), the forcats package should be mentioned here as well, for the sake of completeness. forcats appeared on CRAN in August 2016.

There are several convenience functions available for cleaning up factor levels:

x <- c("Y", "Y", "Yes", "N", "No", "H") 

library(forcats)

Collapse factor levels into manually defined groups

fct_collapse(x, Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

Change factor levels by hand

fct_recode(x, Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

Automatically relabel factor levels, collapse as necessary

fun <- function(z) {
  z[z == "Y"] <- "Yes"
  z[z == "N"] <- "No"
  z[!(z %in% c("Yes", "No"))] <- NA
  z
}
fct_relabel(factor(x), fun)
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

Note that fct_relabel() works with factor levels, so it expects a factor as first argument. The two other functions, fct_collapse() and fct_recode(), accept also a character vector which is an undocumented feature.

Reorder factor levels by first appearance

The expected output given by the OP is

[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No

Here the levels are ordered as they appear in x which is different from the default (?factor: The levels of a factor are by default sorted).

To be in line with the expected output, this can be achieved by using fct_inorder() before collapsing the levels:

fct_collapse(fct_inorder(x), Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
fct_recode(fct_inorder(x), Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")

Both return the expected output with levels in the same order, now.

Uwe
  • 34,565
  • 10
  • 75
  • 109
8

Perhaps a named vector as a key might be of use:

> factor(unname(c(Y = "Yes", Yes = "Yes", N = "No", No = "No", H = NA)[x]))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: No Yes

This looks very similar to your last attempt... but this one works :-)

A5C1D2H2I1M1N2O1R2T1
  • 177,446
  • 27
  • 370
  • 450
  • Thanks Ananda. This a great idea. and for my applications, I can probably do away with `unname` ... this just might take the cake – Ricardo Saporta Oct 16 '13 at 21:34
  • Revisiting years later... this will drop levels that do not show up, which might not be desirable, e.g., with `x="N"` only the "No" level will show up in the result. – Frank Apr 12 '17 at 20:09
  • @Frank, isn't this easily resolved by adding explicit `levels` to the `factor` step? – A5C1D2H2I1M1N2O1R2T1 Apr 13 '17 at 03:22
  • @Frank, this question also led to `Factor` from "SOfun" :-) – A5C1D2H2I1M1N2O1R2T1 Apr 13 '17 at 03:28
  • 1
    Ah cool stuff :) Yeah, adding explicit levels works, though you'd have to type the list a second time, save the list somewhere or do some pipery or functioning like `c(Y = "Yes", Yes = "Yes", N = "No", No = "No", H = NA) %>% { factor(unname(.[x]), levels = unique(.)) }` eh. – Frank Apr 13 '17 at 04:44
  • 1
    @frank Even more cool stuff with the added benefit that it orders the levels as in the expected out: `Yes`, `No`. – Uwe Apr 13 '17 at 06:57
5

Another way is to make a table containing the mapping:

# stacking the list from Aaron's answer
fmap = stack(list(Yes = c("Y", "Yes"), No = c("N", "No")))

fmap$ind[ match(x, fmap$values) ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

# or...

library(data.table)
setDT(fmap)[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

I prefer this way, since it leaves behind an easily inspected object summarizing the map; and the data.table code looks just like any other join in that syntax.


Of course, if you don't want an object like fmap summarizing the change, it can be a "one-liner":

library(data.table)
setDT(stack(list(Yes = c("Y", "Yes"), No = c("N", "No"))))[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes
Frank
  • 63,401
  • 8
  • 85
  • 161
4

Since R 3.5.0 (2018-04-23) you can do this in one clear and simple line:

x = c("Y", "Y", "Yes", "N", "No", "H") # The 'H' should be treated as NA

tmp = factor(x, levels= c("Y", "Yes", "N", "No"), labels= c("Yes", "Yes", "No", "No"))
tmp
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No

1 line, maps multiple values to the same level, sets NA for missing levels" – h/t @Aaron

tim
  • 3,104
  • 27
  • 40
3

I add this answer to demonstrate the accepted answer working on a specific factor in a dataframe, since this was not initially obvious to me (though it probably should have been).

levels(df$var1)
# "0" "1" "Z"
summary(df$var1)
#    0    1    Z 
# 7012 2507    8 
levels(df$var1) <- list("0"=c("Z", "0"), "1"=c("1"))
levels(df$var1)
# "0" "1"
summary(df$var1)
#    0    1 
# 7020 2507
Karl Baker
  • 713
  • 8
  • 24
2

I don't know your real use-case, but would strtrim be of any use here...

factor( strtrim( x , 1 ) , levels = c("Y" , "N" ) , labels = c("Yes" , "No" ) )
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: Yes No
Simon O'Hanlon
  • 54,383
  • 9
  • 127
  • 173
2

Similar to @Aaron's approach, but slightly simpler would be:

x <- c("Y", "Y", "Yes", "N", "No", "H")
x <- factor(x)
# levels(x)  
# [1] "H"   "N"   "No"  "Y"   "Yes"
# NB: the offending levels are 1, 2, & 4
levels(x)[c(1,2,4)] <- c(NA, "No", "Yes")
x
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes
gung - Reinstate Monica
  • 10,603
  • 7
  • 53
  • 74
2

First let's note that in this specific case we can use partial matching:

x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c("Yes","No")
x <- factor(y[pmatch(x,y,duplicates.ok = TRUE)])
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

In a more general case I'd go with dplyr::recode:

library(dplyr)
x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c(Y="Yes",N="No")
x <- recode(x,!!!y)
x <- factor(x,y)
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No

Slightly altered if the starting point is a factor:

x <- factor(c("Y", "Y", "Yes", "N", "No", "H"))
y <- c(Y="Yes",N="No")
x <- recode_factor(x,!!!y)
x <- factor(x,y)
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No
Moody_Mudskipper
  • 39,313
  • 10
  • 88
  • 124
1

You may use the below function for combining/collapsing multiple factors:

combofactor <- function(pattern_vector,
         replacement_vector,
         data) {
 levels <- levels(data)
 for (i in 1:length(pattern_vector))
      levels[which(pattern_vector[i] == levels)] <-
        replacement_vector[i]
 levels(data) <- levels
  data
}

Example:

Initialize x

x <- factor(c(rep("Y",20),rep("N",20),rep("y",20),
rep("yes",20),rep("Yes",20),rep("No",20)))

Check the structure

str(x)
# Factor w/ 6 levels "N","No","y","Y",..: 4 4 4 4 4 4 4 4 4 4 ...

Use the function:

x_new <- combofactor(c("Y","N","y","yes"),c("Yes","No","Yes","Yes"),x)

Recheck the structure:

str(x_new)
# Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
Nikhil
  • 813
  • 10
  • 21