16

Hadley Wickham's haven package, applied to a Stata file, returns a tibble with many columns of type "labeled". You can see these with str(), e.g.:

$ MSACMSZ    :Class 'labelled'  atomic [1:8491861] NA NA NA NA NA NA NA NA NA NA ...
  .. ..- attr(*, "label")= chr "metropolitan area size (cmsa/msa)"
  .. ..- attr(*, "labels")= Named int [1:7] 0 1 2 3 4 5 6
  .. .. ..- attr(*, "names")= chr [1:7] "not identified or nonmetropolitan" "100,000 - 249,999" "250,000 - 499,999" "500,000 - 999,999" ...

It would be nice if I could simply extract all these labeled vectors to factors, but I have compared the length of the labels attribute to the number of unique values in each vector, and it is sometimes longer and sometimes shorter. So I think I need to look at all of them and decide how to handle each one individually.

So I would like to extract the values of the labels attribute to a list. However, this function:

labels93 <- lapply(cps_00093.df, function(x){attr(X, which="labels", exact=TRUE)})

returns NULL for all variables.

Is this a tibble vs data frame problem? How do I extract these attributes from the tibble columns into a list?

Note that the labels vector is named, and I need both the labels and the names.

As per @Hack-R's request here is a tiny snippet of my data as converted by dput (which I had never used before). I applied this code:

filter(cps_00093.df, YEAR==2015) %>%
  sample_n(10)  %>%
  select(HHTENURE, HHINTYPE) -> tiny
dput(tiny, file = "tiny")

to produce the file tiny. Hey! That was easy! I thought it would be hard to break off a piece this small.

Opening tiny with Notepad++, this is what I found:

structure(list(HHTENURE = structure(c(2L, 1L, 1L, 2L, 1L, 1L, 
1L, 2L, 1L, 1L), labels = structure(c(0L, 1L, 2L, 3L, 6L, 7L), .Names = c("niu", 
"owned or being bought", "rented for cash", "occupied without payment of cash rent", 
"refused", "don't know")), class = "labelled"), HHINTYPE = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), labels = structure(1:3, .Names = c("interview", 
"type a non-interview", "type b/c non-interview")), class = "labelled")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("HHTENURE", 
"HHINTYPE"))

I suspect this could be made more readable with a little spacing, but I did not want to muck with it for fear of accidentally destroying relevant information.

zx8754
  • 42,109
  • 10
  • 93
  • 154
andrewH
  • 1,979
  • 1
  • 13
  • 26
  • Could you `dput()` the minimum amount of data necessary for a reproducible example that encapsulates the problem? – Hack-R Sep 24 '16 at 01:28
  • 2
    See `?haven::labelled`; they have their own `as_factor` method. – alistaire Sep 24 '16 at 02:46
  • Thanks for the tip, @alistaire! Unfortunately the relative lengths of the number of unique values in the data part of each column and the labels vector are all over the map. For most columns is neither just a one-to-one match nor just a couple of missing codes. So I am not ready to trust a one-size-fits-all conversion until I have looked more closely at the data and figured out what is going on. – andrewH Sep 24 '16 at 03:37
  • 1
    `tiny %>% mutate_all(haven::as_factor)` looks pretty good to me... – alistaire Sep 24 '16 at 05:52
  • @alistaire Well, yes. But that is because I used the simplest two variables I could quickly locate, out of the ~460 actual variable and 8.5 million observations in my real data set. Still, you may be right. I hope you are; I should soon know. and thanks for the code snippet: If I conclude it will work, I'll use it. – andrewH Sep 25 '16 at 00:23
  • I think the only time the difference between Stata labelled vectors and R factors matters is for ordered factors where the numbers in which they're stored actually correspond to something. A very save alternative would be to just add the factor columns as additional variables so you can compare, but that's probably only practical for a subset at a time. Also, you should use `mutate_if` instead of `mutate_all` if any of your columns are of other types. – alistaire Sep 25 '16 at 00:39

3 Answers3

10

The original question asks how 'to extract the values of the labels attribute to a list.' A solution to the main question follows (assuming some_df is imported via haven and has label attributes). Update: I've now added a way to extract a label vector with the package sjlabelled.

library(purrr)
n <- ncol(some_df)
labels_list <- map(1:n, function(x) attr(some_df[[x]], "label") )

# if a vector of character strings is preferable
labels_vector <- map_chr(1:n, function(x) attr(some_df[[x]], "label") )

# to make a simple codebook
library(kable)
variable_name <- names(some_df)
data.frame(variable_name, description = labels_vector) %>%
  kable(format = 'markdown')

# UPDATE: another approach with package sjlabelled
library(sjlabelled)
sjlabelled::get_label(some_df)
Omar Wasow
  • 1,111
  • 16
  • 14
  • IMHO, this should be the accepted answer as it is much more elegant, concise and utilizes purrr. Sweet. – Adi Sarid May 24 '18 at 11:25
2

I'm going to take a go at answering this one, though my code isn't very pretty.

First I make a function to extract a named attribute from a single column.

ColAttr <- function(x, attrC, ifIsNull) {
# Returns column attribute named in attrC, if present, else isNullC.
  atr <- attr(x, attrC, exact = TRUE)
  atr <- if (is.null(atr)) {ifIsNull} else {atr}
  atr
}

Then a function to lapply it to all the columns:

AtribLst <- function(df, attrC, isNullC){
# Returns list of values of the col attribute attrC, if present, else isNullC
  lapply(df, ColAttr, attrC=attrC, ifIsNull=isNullC)
}

Finally I run it for each attribute.

stub93 <- AtribLst(cps_00093.df, attrC="label", isNullC=NA)

labels93 <- AtribLst(cps_00093.df, attrC="labels", isNullC=NA)
labels93 <- labels93[!is.na(labels93)]

All the columns have a "label" attribute, but only some are of type "labeled" and so have a "labels" attribute. The labels attribute is named, where the labels match values of the data and the names tell you what those values signify.

andrewH
  • 1,979
  • 1
  • 13
  • 26
0

Jumping off @omar-waslow answer above, but adding the use of attr_getter.

If the data (some_df) is imported using read_dta in the haven package, then each column in the tibble has an attr called "label". So we split up the dataframe, going column by column. This creates a two column dataframe which can be joined back (after pivot_longer, for example).

library(tidyverse)
label_lookup_map <- tibble(
   col_name = some_df %>% names(),
   labels = some_df %>% map_chr(attr_getter("label"))
)
jameshowison
  • 113
  • 7