112

There seems to be a difference between levels and labels of a factor in R. Up to now, I always thought that levels were the 'real' name of factor levels, and labels were the names used for output (such as tables and plots). Obviously, this is not the case, as the following example shows:

df <- data.frame(v=c(1,2,3),f=c('a','b','c'))
str(df)
'data.frame':   3 obs. of  2 variables:
 $ v: num  1 2 3
 $ f: Factor w/ 3 levels "a","b","c": 1 2 3

df$f <- factor(df$f, levels=c('a','b','c'),
  labels=c('Treatment A: XYZ','Treatment B: YZX','Treatment C: ZYX'))
levels(df$f)
[1] "Treatment A: XYZ" "Treatment B: YZX" "Treatment C: ZYX"

I thought that the levels ('a','b','c') could somehow still be accessed when scripting, but this doesn't work:

> df$f=='a'
[1] FALSE FALSE FALSE

But this does:

> df$f=='Treatment A: XYZ' 
[1]  TRUE FALSE FALSE

So, my question consists of two parts:

  • What's the difference between levels and labels?

  • Is it possible to have different names for factor levels for scripting and output?

Background: For longer scripts, scripting with short factor levels seems to be much easier. However, for reports and plots, this short factor levels may not be adequate and should be replaced with preciser names.

divibisan
  • 8,631
  • 11
  • 31
  • 46
donodarazao
  • 2,563
  • 3
  • 22
  • 26

2 Answers2

135

Very short : levels are the input, labels are the output in the factor() function. A factor has only a level attribute, which is set by the labels argument in the factor() function. This is different from the concept of labels in statistical packages like SPSS, and can be confusing in the beginning.

What you do in this line of code

df$f <- factor(df$f, levels=c('a','b','c'),
  labels=c('Treatment A: XYZ','Treatment B: YZX','Treatment C: ZYX'))

is telling to R that there is a vector df$f

  • which you want to transform into a factor,
  • in which the different levels are coded as a, b, and c
  • and for which you want the levels to be labeled as Treatment A etc.

The factor function will look for the values a, b and c, convert them to numerical factor classes, and add the label values to the level attribute of the factor. This attribute is used to convert the internal numerical values to the correct labels. But as you see, there is no label attribute.

> df <- data.frame(v=c(1,2,3),f=c('a','b','c'))    
> attributes(df$f)
$levels
[1] "a" "b" "c"

$class
[1] "factor"

> df$f <- factor(df$f, levels=c('a','b','c'),
+   labels=c('Treatment A: XYZ','Treatment B: YZX','Treatment C: ZYX'))    
> attributes(df$f)
$levels
[1] "Treatment A: XYZ" "Treatment B: YZX" "Treatment C: ZYX"

$class
[1] "factor"
Joris Meys
  • 98,937
  • 27
  • 203
  • 258
  • 2
    Thanks for the fast answer! I guess I understand the purpose of levels and labels now. Maybe any suggestions for making output humanly better readable without manually editing table names and plot legends? – donodarazao May 03 '11 at 13:02
  • 6
    I would often transform the levels right before plotting/creating labels, e.g. keep the levels as "a","b","c" while manipulating, then use levels(f) – Ben Bolker May 03 '11 at 14:13
  • I thought about both, but both methods have disadvantages. The first might get tedious when plotting a huge number of graphs, and the second might get tedious when a lot of data aggregation is involved in scripting. But apparently there's no way to avoid that easily, so I'll go with your suggestions. :) – donodarazao May 04 '11 at 07:22
  • @42- I'm not sure what you mean with "numeric values". If you mean the internal values in the factor then that's exactly what I said above. Hence the mentioning of *internal* numerical values. If you specify the `levels` argument, you give the values in the input that have to be matched to the `labels` argument. R keeps the labels (as the attribute `levels`, and there's the confusion) and stores integer codes internally. These integer codes have nothing to do with the original values, whatever type they were. I think you misunderstood me. – Joris Meys Jan 03 '16 at 12:56
  • Apologies. What you write was my understanding as well, and now that I am re-reading your question, I cannot see where I thought you said differently. I'll delete my comment because it adds less than nothing. – IRTFM Jan 03 '16 at 17:53
  • Maybe it's good to mention explicitly that if you want to access the factor values, this is always done by the levels (optionally set by the 'labels' argument)? This would clarify the observation of the OP that df$f=='a' does not work when the levels are modified? – Lennert Dec 08 '16 at 12:50
  • You say 'A factor has only a level attribute, which is set by the labels argument in the factor() function'. But (please correct me if I'm wrong) that's just one way; after the factor has been created the level attribute can also be reset afterwards with `levels – sindri_baldur May 02 '18 at 10:38
  • @snoram you can do it afterwards too. The `levels` argument in `factor()` indicates the _input_ levels (i.e. the unique values to look for in the original vector), whereas `labels` gives the _output_ levels (i.e. the labels attached to the internal numeric interpretation). – Joris Meys May 02 '18 at 14:06
  • Thanks. I guess this comment section has gone out of control... But since *input* levels defaults to all values found in vector in an increasing order one would specify them *only* to change the order or convert some (excluded value) to NA OR define values that might appear later. Or am I missing some other "benefit". – sindri_baldur May 02 '18 at 14:29
  • @snoram ignore values you don't want for example. – Joris Meys May 03 '18 at 08:12
  • Just a reminder that you can see the internal codes used, with `as.numeric(df$f)` – John Jan 25 '19 at 01:41
19

I wrote a package "lfactors" that allows you to refer to either levels or labels.

# packages
install.packages("lfactors")
require(lfactors)

flips <- lfactor(c(0,1,1,0,0,1), levels=0:1, labels=c("Tails", "Heads"))
# Tails can now be referred to as, "Tails" or 0
# These two lines return the same result
flips == "Tails"
#[1]  TRUE FALSE FALSE  TRUE  TRUE FALSE
flips == 0 
#[1]  TRUE FALSE FALSE  TRUE  TRUE FALSE

Note that an lfactor requires that the levels be numeric so that they cannot be confused with the labels.

pdb
  • 1,316
  • 9
  • 25
  • 3
    this is a nice package and thanks for posting about it (and writing it). It seems the sort of functionality that should be native to R factors -- nice to see a package that provides this sort of name-value pair mapping with built-in equivalency checks. – Soren Aug 25 '18 at 13:33
  • d'oh! I was excited about using lfactors until I noticed that it "requires that the levels be numeric." Figures that require publication-style labels (Greek letters, italics, superscripts and all) are a good use case for a system of factors that could still include text levels (the latter could help minimize errors by making data tables more readable). – curious lab rat Nov 13 '20 at 05:52
  • curious lab rat, levels are numeric and labels are text. Can you come up with a code example where that is an issue? – pdb Nov 14 '20 at 14:37
  • This should totally be included in base or ggplot. – Herman Toothrot Feb 04 '21 at 16:37