11

There seems to be an issue with the haven (1.1.1) package when including any type of special character in the file path, including just the file name.

Assuming this is a real issue I am looking for some kind of neat hack/solution to get around it.

An (not ideal) example would be to have R take a copy of the file into a more friendly path and give it a "better" filename and then load with haven. Such as:

setwd("c:/temp")
fn <- "randóóm.sav"
file.copy(paste0("./äglæpath/", fn), fn)
file.rename(fn, gsub("[^-\\./a-zA-Z0-9[:space:]]", "", fn))
# now apply read_sav() to the copy

I'm using:

R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
zx8754
  • 42,109
  • 10
  • 93
  • 154
sindri_baldur
  • 22,360
  • 2
  • 25
  • 48
  • I'm unable to reproduce your issue - I saved file at https://stats.idre.ucla.edu/wp-content/uploads/2016/02/p004.sav as `äglæpath.sav` - `read_sav` reads it without error – CPak May 23 '18 at 00:47
  • 1
    @CPak interesting. I tried the same and get `Failed to parse c:/temp/äglæpath.sav: Unable to open file` – sindri_baldur May 23 '18 at 08:19
  • what does `Sys.getlocale()` say? – CJ Yetman May 26 '18 at 10:49
  • @CJYetman LC_COLLATE=Icelandic_Iceland.1252;LC_CTYPE=Icelandic_Iceland.1252;LC_MONETARY=Icelandic_Iceland.1252;LC_NUMERIC=C;LC_TIME=Icelandic_Iceland.1252 – sindri_baldur May 26 '18 at 19:17
  • I downloaded the file from @CPak and renamed it too randóóm.sav. I used read_spss to load it - reads without error. Sys.getlocale() [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" – DJV May 27 '18 at 17:50
  • maybe try `read_sav(enc2native('./äglæpath/randóóm.sav'))` – CJ Yetman May 28 '18 at 09:36
  • @CJYetman had already tried that. – sindri_baldur May 29 '18 at 08:22
  • @DJV I tried `Sys.setlocale("LC_ALL","English")` but I still get the same error. – sindri_baldur May 29 '18 at 08:23
  • Try setting locale to UTF-8 – CJ Yetman May 29 '18 at 08:35
  • @CJYetman, wasn't aware encoding was a local setting in R, you know how I can do that? – sindri_baldur May 29 '18 at 09:58
  • 1
    You can do `Sys.setlocale("LC_ALL", "en_US.UTF-8")` on MacOS, but how the locale is referred to is likely different on Windows. That's based on an assumption that the problem is that the string that you're passing for the filepath gets mangled (i.e. `Failed to parse c:/temp/äglæpath.sav: Unable to open file`). Does `file.exists("randóóm.sav")` return `TRUE`? If so, then it's more likely a problem with `haven` not being able to read the file. The dev version has an `encoding` argument in `read_sav` which you can use to set the encoding of the file (if you can figure out what it is). – CJ Yetman May 29 '18 at 10:25
  • Yes, of course it returns TRUE. Will check out the dev version! – sindri_baldur May 29 '18 at 10:26
  • have you tried the `iconv` function? – user70960 May 29 '18 at 20:05

1 Answers1

2

Unfortunately, I have been able to reproduce the problem on Windows 10 with both the standard version of haven and the devtools version of haven. This appears to be known bug with haven. #371

Recommend Workaround:

Move the file to a directory without German Umlauts in the file path or filename. Thus, your workaround works as stated.

> file.path(dataFilepath, dtaFilename)
[1] "äglæpath/randóóm.dta"

> dtaFilename <- gsub("[^-\\./a-zA-Z0-9[:space:]]", "", dtaFilename)
> bdatFilename <- gsub("[^-\\./a-zA-Z0-9[:space:]]", "", bdatFilename)
> savFilename <- gsub("[^-\\./a-zA-Z0-9[:space:]]", "", savFilename)
> dataFilepath <- gsub("[^-\\./a-zA-Z0-9[:space:]]", "", dataFilepath)

> file.path(dataFilepath, dtaFilename)
[1] "glpath/randm.dta"

> # Stata
> read_dta(dtaDest)
# A tibble: 150 x 5
   sepallength sepalwidth petallength petalwidth species
         <dbl>      <dbl>       <dbl>      <dbl> <chr>  
 1        5.10       3.5         1.40      0.200 setosa 
 2        4.90       3           1.40      0.200 setosa 
 3        4.70       3.20        1.30      0.200 setosa 
 4        4.60       3.10        1.5       0.200 setosa 
 5        5          3.60        1.40      0.200 setosa 
 6        5.40       3.90        1.70      0.400 setosa 
 7        4.60       3.40        1.40      0.300 setosa 
 8        5          3.40        1.5       0.200 setosa 
 9        4.40       2.90        1.40      0.200 setosa 
10        4.90       3.10        1.5       0.100 setosa 
# ... with 140 more rows
> 

Github Bug #371

Read_*() does not work for special characters in file path #371 https://github.com/tidyverse/haven/issues/371

the problem code is in DfReader.cpp df.parse_dta() 594-612 in haven/src/DFReader.cpp.

Code to Reproduce

require(haven)
require(stringi)

dtaURL  <- "https://github.com/tidyverse/haven/blob/master/inst/examples/iris.dta?raw=true"
bdatURL <- "https://github.com/tidyverse/haven/blob/master/inst/examples/iris.sas7bdat?raw=true"
savURL  <- "https://github.com/tidyverse/haven/blob/master/inst/examples/iris.sav?raw=true"

dtaFilename   <- "randóóm.dta"
bdatFilename <- "randóóm.bdata"
savFilename   <- "randóóm.sav"

dataFilepath      <- "äglæpath"

if (!dir.exists(dataFilepath)) {
  dir.create(file.path(dataFilepath), showWarnings = TRUE)
}

dtaDest = file.path(dataFilepath, dtaFilename)
bdatDest = file.path(dataFilepath, bdatFilename )
savDest = file.path(dataFilepath, savFilename )

download.file(dtaURL, destfile = dtaDest, method = "wget", mode = "wb")
download.file(bdatURL, destfile = bdatDest, method = "wget", mode = "wb")
download.file(savURL, destfile = savDest, method = "wget", mode = "wb")


# Stata
read_dta(dtaDest)

# SAS
read_sas(bdatDest)

# SPSS
read_sav(savDest)

Console Output

> require(haven)
> require(stringi)
> dtaURL  <- "https://github.com/tidyverse/haven/blob/master/inst/examples/iris.dta?raw=true"
> bdatURL <- "https://github.com/tidyverse/haven/blob/master/inst/examples/iris.sas7bdat?raw=true"
> savURL  <- "https://github.com/tidyverse/haven/blob/master/inst/examples/iris.sav?raw=true"
> dtaFilename   <- "randóóm.dta"
> bdatFilename <- "randóóm.bdata"
> savFilename   <- "randóóm.sav"
> dataFilepath      <- "äglæpath"
> if (!dir.exists(dataFilepath)) {
+   dir.create(file.path(dataFilepath), showWarnings = TRUE)
+ }
> dtaDest = file.path(dataFilepath, dtaFilename)
> bdatDest = file.path(dataFilepath, bdatFilename )
> savDest = file.path(dataFilepath, savFilename )
> download.file(dtaURL, destfile = dtaDest, method = "wget", mode = "wb")
--2018-05-29 15:56:59--  https://github.com/tidyverse/haven/blob/master/inst/examples/iris.dta?raw=true
Resolving github.com (github.com)... 192.30.255.113, 192.30.255.112
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/tidyverse/haven/raw/master/inst/examples/iris.dta [following]
--2018-05-29 15:56:59--  https://github.com/tidyverse/haven/raw/master/inst/examples/iris.dta
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/tidyverse/haven/master/inst/examples/iris.dta [following]
--2018-05-29 15:56:59--  https://raw.githubusercontent.com/tidyverse/haven/master/inst/examples/iris.dta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8213 (8.0K) [application/octet-stream]
Saving to: '\344gl\346path/rand\363\363m.dta'

     0K ........                                              100% 1.56M=0.005s

2018-05-29 15:56:59 (1.56 MB/s) - '\344gl\346path/rand\363\363m.dta' saved [8213/8213]

> download.file(bdatURL, destfile = bdatDest, method = "wget", mode = "wb")
--2018-05-29 15:56:59--  https://github.com/tidyverse/haven/blob/master/inst/examples/iris.sas7bdat?raw=true
Resolving github.com (github.com)... 192.30.255.113, 192.30.255.112
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/tidyverse/haven/raw/master/inst/examples/iris.sas7bdat [following]
--2018-05-29 15:56:59--  https://github.com/tidyverse/haven/raw/master/inst/examples/iris.sas7bdat
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/tidyverse/haven/master/inst/examples/iris.sas7bdat [following]
--2018-05-29 15:56:59--  https://raw.githubusercontent.com/tidyverse/haven/master/inst/examples/iris.sas7bdat
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 131072 (128K) [application/octet-stream]
Saving to: '\344gl\346path/rand\363\363m.bdata'

     0K .......... .......... .......... .......... .......... 39% 4.05M 0s
    50K .......... .......... .......... .......... .......... 78% 19.7M 0s
   100K .......... .......... ........                        100% 19.3M=0.02s

2018-05-29 15:57:00 (7.83 MB/s) - '\344gl\346path/rand\363\363m.bdata' saved [131072/131072]

> download.file(savURL, destfile = savDest, method = "wget", mode = "wb")
--2018-05-29 15:57:01--  https://github.com/tidyverse/haven/blob/master/inst/examples/iris.sav?raw=true
Resolving github.com (github.com)... 192.30.255.113, 192.30.255.112
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/tidyverse/haven/raw/master/inst/examples/iris.sav [following]
--2018-05-29 15:57:01--  https://github.com/tidyverse/haven/raw/master/inst/examples/iris.sav
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/tidyverse/haven/master/inst/examples/iris.sav [following]
--2018-05-29 15:57:01--  https://raw.githubusercontent.com/tidyverse/haven/master/inst/examples/iris.sav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6690 (6.5K) [application/octet-stream]
Saving to: '\344gl\346path/rand\363\363m.sav'

     0K ......                                                100% 3.09M=0.002s

2018-05-29 15:57:01 (3.09 MB/s) - '\344gl\346path/rand\363\363m.sav' saved [6690/6690]

> # Stata
> read_dta(dtaDest)
Error in df_parse_dta_file(spec, encoding) : 
  Failed to parse <...>/äglæpath/randóóm.dta: Unable to open file.
Technophobe01
  • 7,300
  • 2
  • 27
  • 54
  • pretty sure you need `?raw=true` at the end of those URLs to get the file – CJ Yetman May 29 '18 at 23:01
  • can you try this: `devtools::install_github("tidyverse/haven"); library(haven); url < "https://github.com/tidyverse/haven/blob/master/inst/examples/iris.sav?raw=true"; download.file(url, destfile = "äglæpath/randóóm.sav", method = "wget", mode = "wb"); read_sav("äglæpath/randóóm.sav", encoding = "utf-8")` – CJ Yetman May 29 '18 at 23:11
  • @CJYetman I tried the devtools version and confirmed the bug still exists previously as part of debugging the problem. The output is `> # Stata > read_dta(dtaDest, encoding = "utf-8") Error in df_parse_dta_file(spec, encoding) : Failed to parse .../äglæpath/randóóm.dta: Unable to open file.` – Technophobe01 May 29 '18 at 23:19
  • what do you mean by "confirmed the bug still exists previously"? that the bug exists in the current release version, or that it also exists in the current dev version? – CJ Yetman May 29 '18 at 23:33
  • I confirmed it exists in the current version and the current dev version. That a bug has been filed on both the read and save haven functions. Looking at haven source code I think there is a patch to be applied to the function that addresses this. i.e. Update the function in preparation for a downstream bug fix. – Technophobe01 May 29 '18 at 23:43
  • ok, thanks. It wasn't completely clear from your answer/post that you tried the dev version as well – CJ Yetman May 29 '18 at 23:45
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/172026/discussion-between-technophobe01-and-cj-yetman). – Technophobe01 May 30 '18 at 00:30
  • Nice you confirmed the problem and to be closer to the source of the problem but the workaround is essentially the same as I had suggested in my original post. – sindri_baldur May 30 '18 at 08:11
  • 1
    Snoram - I was'nt doing it for the points I was doing it to help. In general my approach would be to rename the files and place them in a data directory. Separately, I would recommend you go on github and note you have this bug. and contact @hadley then look at the haven source (I have) to figure out a patch. (My original intent) – Technophobe01 15 mins ago – Technophobe01 May 30 '18 at 13:55
  • I added a link to this question on the Github issue entry. Given that I have limited time and c++ skills... I guess I will leave it at that for now. Thanks for looking into this! – sindri_baldur May 30 '18 at 14:10