1

I have a large survey that I am working with. The main survey data is stored as a CSV file. The variable and value label files are included in both SAS-format as well as SPSS-format, but stored as TXT files.

I have seen that there are several ways to easily read in data to R when it is saved in SAS/SPSS/STATA native formats (.sas/.sav/.dta). So far, all of the solutions that I've come across using haven, labelled, foreign solutions assume that the data is stored using the respective "other" program's native format. I can't seem to find any advice on how to add labels to CSV data using TXT files stored in some standard SAS/SPSS format.

So let's say the three files I have are named:

  • data.csv
  • sas_var_labels.txt
  • sas_val_labels.txt

data.csv looks something like:

AB001; AB002; AC001
-9; -9; -7
-1; -9; -8
-3; -9; 100
-9; -1; 200
-4; -1; 100

sas_var_labels.txt looks like this:

AB001         =  "A-Section A, category B, question 1"                                                                                                        
AB002         =  "A-Section A, category B, question 2"                                                                                                         
AC001         =  "A-Section A, category C, question 1"  

And finally, sas_val_labels.txt looks like this:

; value AB001                       -9      =    "-9.not applicable"                                                                                                                                                                                   
                                    -8      =    "-8.no response"                                                                                                                                                                                      
                                    -7      =    "-7.unknown"                                                                                                                                                                                         
                                    -1      =    "-1.other duration"                                                                                                                                                                                     
                                    1       =    "1.1 year"                                                                                                                                                                                      
                                    2       =    "2.1 to 3 years"                                                                                                                                                                                           
                                    3       =    "3.4 to 6 years"
                                    4       =    "4.More than 6 years"                                                                                                                                                                                       
; value AB002                       -9      =    "-9.not applicable"                                                                                                                                                                                   
                                    -8      =    "-8.no response"                                                                                                                                                                                      
                                    -7      =    "-7.unknown"                                                                                                                                                                                         
                                    -1      =    "-1.other type"                                                                                                                                                                                     
; value AC001                       -9      =    "-9.not applicable"                                                                                                                                                                                   
                                    -8      =    "-8.no response"                                                                                                                                                                                      
                                    -7      =    "-7.unknowns" 
                                    -5      =    "-5.non-codable"
                                    -1      =    "-1.other category"                                                                                                                                                                                     
                                    100     =    "100.First division"                                                                                                                                                                                  
                                    200     =    "200.Second division"            

My approach so far:

For the main data, I just call:

dat <- read.csv("data.csv", sep=";", stringsAsFactors=FALSE)

For the variable labels, as the number of labels matches the columns in the data exactly, I'm just reading in the TXT file the same way, and then assigning the second column to the data using var_label() from the labelled package:

#first I read in the variable labels into R as a dataframe
var_labs <- read.csv("sas_var_label.txt", sep="=", stringsAsFactors=FALSE, header=FALSE, strip.white=TRUE)  

#next, I assign the second column to the data as it matches exactly (for the moment)  
labelled::var_label(dat) <- var_labs$V2

For the value labels, on the other hand, my approach becomes more complicated as the labels file is not neatly separated by delimiters for all of the columns and row combinations, as you can see in the example provided above.

My first question is: is there an easy way to read in this file so that the structure of the value label TXT file is preserved?

My more general question is: is there a a better approach to handling these label TXT files? I'm sure I'm missing something, so any suggestions would be welcome.

David Nichols
  • 559
  • 3
  • 5
anguyen1210
  • 353
  • 4
  • 15

1 Answers1

1

Coming back to this, I never did find an easy solution to this problem, but I did manage to hack a solution together, which I share here just in case anyone else comes across this issue. From what I can tell, the TXT file exports of the label files from SAS or SPSS will share similar problems. Namely, in order to use these label files with the haven and labelled packages in R (specifically the latter), the TXT files need to be converted into named objects. For the variable label file, this means a named character vector, and for the value label files, this means a named list. I only address the SAS exports here, but the same basic approach works for an SPSS export as well.

importing the variable labels into R

So going back to the toy examples provided in my original question, we consider a TXT export of a variable label file sas_var_labels.txt that looks like:

AB001         =  "A-Section A, category B, question 1"                                                                                                        
AB002         =  "A-Section A, category B, question 2"                                                                                                         
AC001         =  "A-Section A, category C, question 1" 

In this case, the respective columns are evenly spaced, so a simple read_delim call using the space as a delimter works well:

varlist <- readr::read_delim("sas_var_labels.txt", 
                                 delim = " = ", 
                                 col_names = FALSE, 
                                 col_types = cols(X1 = col_character(),
                                                  X2 = col_skip(),
                                                  X3 = col_character(),
                                                  X4 = col_skip()
                                 ),
                                 locale = locale(encoding = "Latin1"), 
                                 trim_ws=TRUE)

varlist <- tibble::deframe(varlist)

In the code above, you can see that we can control directly for some of the blank space in the TXT file that will be parsed into empty columns when we use the space as a delimiter (i.e. the col_skip() arguments). In my example, I add a locale encoding argument to account for data with special characters, and I directly trim off any whitespace with trim_ws. Note that, SPSS exports of the TXT labels have an offset row with column names at the top that botches up this read-in, but you can easily add a skip = 1 to this call if needed.

As a final step, passing the imported varlist file into tibble::deframe() converts it into a named character vector that you can use to label your data with the labelled package. The final output should look something like this:
named character vector

importing the value labels into R

The value label exports from SAS and SPSS are a bit trickier to deal with. In order for the labelled packaged to process them, they need to be converted to named lists. Importantly, the numeric values associated with each value label need to be preserved as numeric class.

Going back to our SAS export example from above, say we have an exported label TXT file sas_val_labels.txt that looks like this:

; value AB001                       -9      =    "-9.not applicable"                                                                                                                                                                                   
                                    -8      =    "-8.no response"                                                                                                                                                                                      
                                    -7      =    "-7.unknown"                                                                                                                                                                                         
                                    -1      =    "-1.other duration"                                                                                                                                                                                     
                                    1       =    "1.1 year"                                                                                                                                                                                      
                                    2       =    "2.1 to 3 years"                                                                                                                                                                                           
                                    3       =    "3.4 to 6 years"
                                    4       =    "4.More than 6 years"                                                                                                                                                                                       
; value AB002                       -9      =    "-9.not applicable"                                                                                                                                                                                   
                                    -8      =    "-8.no response"                                                                                                                                                                                      
                                    -7      =    "-7.unknown"                                                                                                                                                                                         
                                    -1      =    "-1.other type"                                                                                                                                                                                     
; value AC001                       -9      =    "-9.not applicable"                                                                                                                                                                                   
                                    -8      =    "-8.no response"                                                                                                                                                                                      
                                    -7      =    "-7.unknowns" 
                                    -5      =    "-5.non-codable"
                                    -1      =    "-1.other category"                                                                                                                                                                                     
                                    100     =    "100.First division"                                                                                                                                                                                  
                                    200     =    "200.Second division"

First we can break this up into orderly columns by using the fact that the columns seem to obey a fixed-with. Using read_fwf(), and stripping out all of the extraneous text and characters until we arrive at neatly organized dataframe. We can then iterate over that dataframe to pull out the variable names, and then reorganize everything into a big list of values under each name. The following worked for me:

value_df <- readr::read_fwf(valuelab, 
                                    fwf_positions(c(1, 9, 29, 43, 49), 
                                                  c(7, 25, 43, 47, NA), 
                                                  c("junk", "var", "val", "delim", "val_lab")),
                                    col_types = cols(junk = col_skip(),
                                                     var = col_character(),
                                                     val = col_integer(),
                                                     delim = col_skip(),
                                                     val_lab = col_character()),
                                    locale = locale(encoding = "Latin1"),
                                    trim_ws = TRUE)
            
value_df <- value_df %>% tidyr::fill(var, .direction="down")
value_df$val_lab <- value_df$val_lab %>% str_replace_all("\"", "")

#Note: it may or may not be necessary to strip extra chars from variable names,
#depending on how the export was done to the TXT file
#value_df$var <- value_df$var %>% substr(., 1, nchar(.)-1) 

valuelist <- value_df %>% group_split(var) 
names(valuelist) <- value_df %>% group_keys(var) %>% pull
        
valuelist <- sapply(valuelist, function(x) {
            x %>% select(val_lab, val) %>% tibble::deframe

The key points to note here are that the read_fwf takes the argument fwf_positions where you can specify the exact position you want to separate the column (read the read_fwf help file for more details). Scrolling through my TXT file and using some trial and error allowed me to get to the write numbers for this argument.

Because the resulting columns were still somewhat messy, I also named them at the same time while parsing (c("junk", "var", "val", "delim", "val_lab"))) so that I could easily keep track of how I then handle each of the newly parsed columns (col_types = cols(junk = col_skip(), var = col_character(), ...). Naming them directly in this step also helps later on when we need to break everything up into the named list.

The next couple of lines just strip out all of the extraneous characters and ensure that the variable names are filled down correctly ( tidyr::fill(var, .direction="down")).

Finally, I use dplyr::group_split() to break up the dataframe into a list by variable names, assign the list object those names (value_df %>% group_keys(var) %>% pull), and then iterate over each variable in that list to convert the value labels into the correct, named value label-integer pair (sapply(valuelist, function(x) {x %>% select(val_lab, val) %>% tibble::deframe).

The final desired output based on the example given should look like this: named list

Finally, with the two label files imported into R, you can easily use them to label the data with the following:

labelled::var_labels(data) <- varlist

labelled::val_labels(data) <- valuelist

Once the data is labelled, all of the other functionality included in the labelled package also becomes available to you.

I also have a write up on how I used these functions for a specific use case on my blog

anguyen1210
  • 353
  • 4
  • 15