Convenient way to access variables label after importing Stata data with haven

Question

In R, some packages (e.g. haven) insert a label attributes to variables (e.g. haven), which explains the substantive name of the variable. For example, gdppc may have the label GDP per capita.

This is extremely useful, especially when importing data from Stata. However, I still struggle to know how to use this in my workflow.

How to quickly browse the variable and the variable label? Right now I have to do attributes(df$var), but this is hardly convenient to get a glimpse (a la names(df))
How to use these labels in plots? Again, I can use attr(df$var, "label") to access the string label. However, it seems cumbersome.

Is there any official way to use these labels in a workflow? I can certainly write a custom function that wraps around the attr, but it may break in the future when packages implement the label attribute differently. Thus, ideally I'd want an official way supported by haven (or other major packages).

Various package authors have implemented it differently. The _de_facto_ R standard would be the `read.dta` function in pkg:foreign. `haven` is a relatively recent package and at the moment it doesn't seem to have documented plans for labels. — IRTFM, Jan 15 '16 at 18:58
@42- `read_dta` in `haven` does have label. In contrast, `foreign::read.dta` actually doesn't. Also, the `foreign` packages does not work with Stata 13, let alone 14. — Heisenberg, Jan 15 '16 at 18:59
Your question has no example. The help page for `read.dta` says the value will be: `"A data frame with attributes. These will include "datalabel", "time.stamp", "formats", "types", "val.labels", "var.labels" and "version" and may include "label.table" and "expansion.table"`. — IRTFM, Jan 15 '16 at 19:06
You asked about the "official"/idiomatic way of dealing with labels, which is probably best found in `foreign`. Glancing through the `foreign` doc, they suggest the readstata13 package for later versions of Stata. Presumably it also conforms to whatever idiom/norm is found in foreign. — Frank, Jan 15 '16 at 19:08
@42- I stand corrected, `foreign` does have a `var.labels` attribute that is attached to the data frame. This is different from `haven`, but this shows your point that there are different implementations. — Heisenberg, Jan 15 '16 at 19:08
@Heisenberg You might be able to help with this one: https://stackoverflow.com/questions/56787126/convenient-way-to-write-variables-label-to-csv-after-importing-stata-data-wit — Jeremy K., Jun 27 '19 at 08:39

score 16 · Answer 1 · edited Apr 13 '17 at 14:59

16

A solution with purrr package from tidyverse:

df %>% map_chr(~attributes(.)$label)

edited Apr 13 '17 at 14:59

Vadim Kotov

7,103
8
44
57

answered Apr 13 '17 at 14:46

Irina

161
1
4

score 8 · Answer 2 · answered Jul 17 '17 at 18:50

Using sapply in a simple function to return a variable list as in Stata's Variable Window:

library(dplyr)
makeVlist <- function(dta) { 
     labels <- sapply(dta, function(x) attr(x, "label"))
      tibble(name = names(labels),
             label = labels)
}

score 4 · Answer 3 · answered Jan 15 '16 at 20:15

This is one of the innovations addressed in rio (full disclosure: I wrote this package). Basically, it provides various ways of importing variable labels, including haven's way of doing things and foreign's. Here's a trivial example:

Start by making a reproducible example:

> library("rio")
> export(iris, "iris.dta")

Import using foreign::read.dta() (via rio::import()):

> str(import("iris.dta", haven = FALSE))
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "datalabel")= chr ""
 - attr(*, "time.stamp")= chr "15 Jan 2016 20:05"
 - attr(*, "formats")= chr  "" "" "" "" ...
 - attr(*, "types")= int  255 255 255 255 253
 - attr(*, "val.labels")= chr  "" "" "" "" ...
 - attr(*, "var.labels")= chr  "" "" "" "" ...
 - attr(*, "version")= int -7
 - attr(*, "label.table")=List of 1
  ..$ Species: Named int  1 2 3
  .. ..- attr(*, "names")= chr  "setosa" "versicolor" "virginica"

Read in using haven::read_dta() using its native variable attributes because the attributes are stored at the data.frame level rather than the variable level:

> str(import("iris.dta", haven = TRUE, column.labels = TRUE))
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     :Class 'labelled'  atomic [1:150] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "labels")= Named int [1:3] 1 2 3
  .. .. ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica"

Read in using haven::read_dta() using an alternative that we (the rio developers) have found more convenient:

> str(import("iris.dta", haven = TRUE))
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "var.labels")=List of 5
  ..$ Sepal.Length: NULL
  ..$ Sepal.Width : NULL
  ..$ Petal.Length: NULL
  ..$ Petal.Width : NULL
  ..$ Species     : NULL
 - attr(*, "label.table")=List of 5
  ..$ Sepal.Length: NULL
  ..$ Sepal.Width : NULL
  ..$ Petal.Length: NULL
  ..$ Petal.Width : NULL
  ..$ Species     : Named int  1 2 3
  .. ..- attr(*, "names")= chr  "setosa" "versicolor" "virginica"

By moving the attributes to be at the level of the data.frame, they're much easier to access using attr(data, "label.var"), etc. rather than digging through each variable's attributes.

Note: the values of attributes will be NULL because I'm just writing a native R dataset to a local file in order to make this reproducible.

Do you use `rio` in your own workflow? How committed is the team to staying up to date with all the IO packages that you guys wrap around? I'd love to have a centralized package like this (hence I tried `haven`), but worry about future reproducibility. — Heisenberg, Jan 15 '16 at 21:34
@Heisenberg We are very committed to keeping things up to date and I am very sensitive to backwards/forwards compatibility. — Thomas, Jan 15 '16 at 22:28

score 3 · Answer 4 · answered Apr 30 '18 at 10:57

3

A simple solution with the labelled package (tidyverse)

descriptions <- var_label(data_raw) %>% 
  as_tibble() %>% 
  gather(key = variable, value = description)

answered Apr 30 '18 at 10:57

Henrik

811
8
7

score 1 · Answer 5 · answered Jun 05 '18 at 07:48

The purpose of the labelled package is to provide convenient functions to manipulate variable and value labels as imported with haven.

In addition, the functions lookfor and describe from the questionr package are also useful to display variable and value labels.

Convenient way to access variables label after importing Stata data with haven

5 Answers5

Linked