0

TL;DR

I have a snippet of text

str <- '"foo\\dar embedded \\\"quote\\\""'
# cat(str, '\n') # gives
# "foo\dar embedded \"quote\""
# i.e. as if the above had been written to a CSV with quoting turned on.

I want to end up with the string:

str <- 'foo\\dar embedded "quote"'
# cat(str, '\n') # gives
# foo\dar embedded "quote"

essentially removing one "layer" of quoting. How may I do this?

(Initial attempt -- eval(parse(text=str)), which works unless you have something like \\dar, where you get the error "\d is an unrecognized escape in character string ...").

Gory details (optional)

The reason my strings are quoted once-too-many times is I kludged some data processing -- I wrote str (well, a dataframe in my case) to a table with quoting enabled, but forgot that many of the columns in my dataframe had embedded newlines with embedded quotes (i.e. forgot to escape/remove them).

It turns out that when I read.table a file with multiple columns in the same row that have embedded newlines and embedded quotes (or something like that), the function fails (fair enough).

I had since closed my R session so my only access to my data was through my munged CSV. So I wrote some spaghetti code to simply readLines my CSV and split everything up to reconstruct my dataframe again. However, since all my character columns were quoted in the CSV, I have a few columns in my restored dataframe that are still quoted that I want to unquote.

Messy, I know. I'll remember to save an original version of the data next time (save, saveRDS).


For those interested, the header row and three rows of my CSV are shown below (all the characters are ASCII)

"quote";"id";"date";"author";"context"
"< mwk> I tried to fix the bug I mentioned, but I accidentally ascended the character I started for testing... hoped she'd die soon and I could get to coding, but alas I was wrong";"< mwk> I tried to fix the bug I mentioned, but I accidentally ascended the character I started for testing... hoped she'd die soon and I could get to coding, but alas I was wrong";"February 28, 2013";"nhqdb";"nhqdb"
"< intx14> \"A gush of water hits the air elemental on the central core!\"
< intx14> What is this, a weather forecast?";"< intx14> \"A gush of water hits the air elemental on the central core!\"
< intx14> What is this, a weather forecast?";"February 28, 2013";"nhqdb";"nhqdb"
"< bcode> n - a spherical amulet.  You are lucky!  Full moon tonight.
< bcode> That must be a sign - I'll put it on! What could possibly go wrong...
< oracle\devnull> DIED : bcode2 (Wiz-Elf-Mal-Cha) 0 points, killed by strangulation on pcs1.nethack.devnull.net";"< bcode> n - a spherical amulet.  You are lucky!  Full moon tonight.
< bcode> That must be a sign - I'll put it on! What could possibly go wrong...
< oracle\devnull> DIED : bcode2 (Wiz-Elf-Mal-Cha) 0 points, killed by strangulation on pcs1.nethack.devnull.net";"February 28, 2013";"nhqdb";"nhqdb"

The first two columns of each row are the same, being the quote (the first row has no embedded newlines in the quote; the second and third do). Separator is ';'.

> read.table('test.csv', sep=';', header=T)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 5 elements
# same for with ,allowEscape=T
Community
  • 1
  • 1
mathematical.coffee
  • 51,909
  • 10
  • 130
  • 180

2 Answers2

1

Use regular expressions:

str <- gsub('^"|"$', '', gsub('\\\"', '"', str, fixed = TRUE))
Robert Krzyzanowski
  • 8,898
  • 23
  • 24
-3

[EDIT 3: the OP has posted three separate versions of this - two of them irreproducible, interspersed with complaining. Due to this timewasting behavior and several people downvoting, I'm leaving the original answer to version 2 of the question.]

EDIT 1: My solution to the second version of the OP's question was this: txt <- read.csv('escaped.csv', header=T, allowEscapes=T, sep=';')

EDIT 2: We now get a third version. Finally some reproducible code after 36 minutes asking and waiting. Due to the behavior of the OP and other posters I'm not inclined to waste more time on this. I'm going to complain about both of your behavior on MSO. Downvote yourselves silly.

ORIGINAL: gsub is the ugly way.

Use read.csv(..., allowEscapes=TRUE, quote=..., encoding=...) arguments. See the manpage, section on Encoding

If you want actual code, you need to give us a full line or two of your CSV file.

See also SO: "How to detect the right encoding for read.csv?"

Quoting the relevant part of your question:

The reason my strings are quoted once-too-many times is I kludged some data processing -- I wrote str (well, a dataframe in my case) to a table with quoting enabled, but forgot that many of the columns in my dataframe had embedded newlines within quotes (i.e. forgot to escape/remove them).

It turns out that when I read.table a file with multiple columns in the same row that have embedded newlines within quotes, the function fails (fair enough).

Community
  • 1
  • 1
smci
  • 26,085
  • 16
  • 96
  • 138
  • encoding is not my issue here. – mathematical.coffee Apr 03 '14 at 02:31
  • Yes it is, although you don't know it. You can fix unwanted escaping. – smci Apr 03 '14 at 02:35
  • Please try with the updated snippet, in which case the `read.csv` solution does not work (I had to find a snippet that reproduced my problem). It appears the issue is with embedded newlines in quoted columns where there are embedded quotes as well within the quoted columns. – mathematical.coffee Apr 03 '14 at 02:52
  • All the earlier comments noting that you only finally gave us a reproducible testcase after 36mins and multiple requests to do so have been deleted. It's important to note this context, and why after lots of patience with the question, I'm not inclined to keep answering the latest revision. I object to the behavior that went on in this question, in the strongest terms. You might reasonably apologize both for wasting my time and complaining nonstop. – smci Apr 03 '14 at 05:15
  • I apologize that I did not provide the appropriate reproducible example in your timeframe, though my original question was specifically "how to unquote a string" and not "how to parse the original CSV properly", hence my original question was reproducible as it stood. (I only updated with the CSV snippet because you wanted to answer the question "how to parse the original CSV", not "how to unquote a string", and I agree I was sloppy with providing the reproducible example there due to not understanding why `read.csv` was failing. Sorry.) I also note that I did not delete your comments. – mathematical.coffee Apr 04 '14 at 23:20
  • But your root-cause was reading the CSV - precisely what I told you within 10 minutes of you posting this: **it breaks escaped Unicode, URL-escapes**. I know this because I solved it in Python two years ago, and I was helpfully offering you that information. "I don't care, give me hackish workaround" is not a gracious response. After all this, your question is *still* misstated, you need to go edit something in line 1 like: "I get bad strings because I wrote&read in a CSV file the wrong way". The offending string did not just create itself - you did! In future I'll just edit bad questions ASAP – smci Apr 04 '14 at 23:40
  • ...and not wait for the user's consent, while it rains downvotes. – smci Apr 04 '14 at 23:42
  • My file does not have any special unicode or URLs (`oracle\devnull` in the example is a literal backslash, not '\d'). I did mention that my strings got that way because I screwed up the CSV. I'm sorry that I even mentioned the CSV, I had only put it in for context (bandaid solution was all I was after). I do thank you for identifying the root cause of my problem. FWIW, I did try to research encodings, but I'm unaware of what encoding will deal with a CSV with embedded quotes and embedded newlines within quoted columns (the contents of those columns are all plain ASCII). – mathematical.coffee Apr 04 '14 at 23:51