Identify all instances of problematic quotation marks

Question

I have a (properly formed) large string variable that I turn into lists of dictionaries. I iterate over the massive string, split by newline characters, and run the following list(eval(i)). This works for the majority of the cases, but for every exception thrown, I add the 'malformed' string into a failed_attempt array. I have been inspecting the failed cases for an hour now, and believe what causes them to fail is whenever there is an extra quotation mark that is not part of the keys for a dictionary. For example,

eval('''[{"question":"What does "AR" stand for?","category":"DFB","answers":["Assault Rifle","Army Rifle","Automatic Rifle","Armalite Rifle"],"sources":["https://www.npr.org/2018/02/28/588861820/a-brief-history-of-the-ar-15"]}]''')

Will fail because there is quotation marks around the "AR." If you replace the quotation marks with single quotation marks, e.g.

eval('''[{"question":"What does 'AR' stand for?","category":"DFB","answers":["Assault Rifle","Army Rifle","Automatic Rifle","Armalite Rifle"],"sources":["https://www.npr.org/2018/02/28/588861820/a-brief-history-of-the-ar-15"]}]''')

It now succeeds.

Similarly:

eval('''[{"question":"Test Question, Test Question?","category":"DFB","answers":["2004","1930","1981","This has never occurred"],"sources":[""SOWELL: Exploding myths""]}]''')

Fails due to the quotes around "Sowell", but again succeeds if you replace them with single quotes.

So I need a way to identify quotes that appear anywhere other than around the keys of the dictionary (question, category, sources) and replace them with single quotes. I'm not sure the right way to do this.

@Wiktor's submission nearly does the trick, but will fail on the following:

example = '''[{"question":"Which of the following is NOT considered to be "interstate commerce" by the Supreme Court, and this cannot be regulated by Congress?","category":"DFB","answers":["ANSWER 1","ANSWER 2","ANSWER 3","All of these are considered "Interstate Commerce""],"sources":["SOURCE 1","SOURCE 2","SOURCE 3"]}]'''
re.sub(r'("\w+":[[{]*")(.*?)("(?:,|]*}))', lambda x: "{}{}{}".format(x.group(1),x.group(2).replace('"', "'"),x.group(3)), example)


Out[170]: '[{"question":"Which of the following is NOT considered to be \'interstate commerce\' by the Supreme Court, and this cannot be regulated by Congress?","category":"DFB","answers":["ANSWER 1","ANSWER 2","ANSWER 3","All of these are considered "Interstate Commerce""],"sources":["SOURCE 1","SOURCE 2","SOURCE 3"]}]'

Notice that the second set of double quotation marks on "Interstate Commerce" in the answers is not replaced.

I don't know if it's possible to solve this problem in an unambiguous way. Consider the malformed dict literal `{"":"":""}`. You can make it a valid literal by replacing one pair of quotation marks, but there are two ways to do this, either `{"':'":""}` or `{"":"':'"}`. So whatever literal repairing algorithm you come up with, you can't be totally sure that the result is what the original author intended before their data became malformed. — Kevin, Oct 25 '19 at 13:54
Try `re.sub(r'("\w+":[[{]*")(.*?)("(?:,|]*}))', lambda x: "{}{}{}".format(x.group(1),x.group(2).replace('"', "'"),x.group(3)), text)`. Well, it is not actually what you may use, but it shows a way if you choose regex. — Wiktor Stribiżew, Oct 25 '19 at 13:54
The problem is in constructing the string to pass to eval. There are two solutions I can think of - when constructing the string by adding a value, check the value and it if contains one type of quote ensure you wrap it in the other, although this will/must fail if a value contains both types of quote. Or much better would be to avoid eval completely and construct the list directly rather than through constructing a string looking like a string representation of a list then evaluating it. — barny, Oct 25 '19 at 14:02
@Kevin, point taken, but for those cases, I can manually inspect them and drop them as they contain no relevant content. — Parseltongue, Oct 25 '19 at 14:11
@WiktorStribiżew your regex nearly does the trick, but fails on some cases. Thanks! I made an edit to the body to show an example — Parseltongue, Oct 25 '19 at 14:22
@Parseltongue just a simple question is the keys fixed word like `question, category, answers, sources` the answers value is always a list? — Charif DZ, Oct 25 '19 at 14:39
@CharifDZ yes. Those four will always be the keys. And "answers" and "sources" will always be a list. — Parseltongue, Oct 25 '19 at 14:40

score 1 · Answer 1 · answered Oct 25 '19 at 15:09

Rather than converting the values extracted from this monster string back into a string representation of a list and then using eval(), simply take the things you get in variables and simply append the variables to the list.

Or construct a dict frpom the values rather than creating a string representation of a dictionary then evaluating it.

It doesn't help that you haven't put any code in your question, so these answers are sketchy. If you put a https://stackoverflow.com/help/minimal-reproducible-example in your question, with some minimal data - very minimal - a good one that doesn't cause an exception in eval() and a bad example that recreates the problem, then I should be able to better suggest how to apply my answer.

Your code must be doing something a bit like this:

import traceback

sourcesentences = [
     'this is no problem'
     ,"he said 'That is no problem'" 
     ,'''he said "It's a great day"''' 
]

# this is doomed if there is a single or double quote in the sentence
for sentence in sourcesentences:
    words = sentence.split()
    myliststring="[\""+"\",\"".join(words)+"\"]"    
    print( f"The sentence is >{sentence}<" )
    print( f"my string representation of the sentence is >{myliststring}<" )
    try:
        mylistfromstring = eval(myliststring)
        print( f"my list is >{mylistfromstring}<" )
    except SyntaxError as e:
        print( f"eval failed with SyntaxError on >{myliststring}<")
        traceback.print_exc()
    print()

And this produces a SyntaxError on the third test sentence

Now let's try escaping characters in the variable before wrapping them in quotation marks:

# this adapts to a quote within the string
def safequote(s):
    if '"' in s:
        s = s.replace( '"','\\"' )
    return s

for sentence in sourcesentences:
    print( f"The sentence is >{sentence}<" )
    words = [safequote(s) for s in sentence.split()]
    myliststring="[\""+"\",\"".join(words)+"\"]"    
    print( f"my string representation of the sentence is >{myliststring}<" )
    try:
        mylistfromstring = eval(myliststring)
        print( f"my list is >{mylistfromstring}<" )
    except SyntaxError as e:
        print( f"eval failed with SyntaxError on >{myliststring}<")
        traceback.print_exc()
    print()

This works, but is there a better way?

Isn't it a lot simpler avoiding eval which means avoiding constructing a string representation of the list which means avoiding problems with quotation marks in the text:

for sentence in sourcesentences:
    print( f"The sentence is >{sentence}<" )
    words = sentence.split()
    print( f"my list is >{words}<" )
    print()

Hey Barny. Thanks so much for your hard work on this! I did give three (reproducible) examples that can be converted from a "good" case to a "bad" case by removing inappropriate quotation marks that are embedded inside quotation marks. My entire string is just a massive list "\n"-delimited string of lists of dictionaries. Just imagine appending the first and second example next to each other with a return carriage in between. — Parseltongue, Oct 25 '19 at 15:13
The issue with the "escape" strategy is that there are many quotes that I don't want to escape because they correctly define the keys or the values of the dictionary. For example, I don't want to escape the quotation marks surrounding the words: `sources`, `answers`, `category`, and `question`, because you need those quotation marks to define the keys of a dictionary. Similarly, I have quotation marks around each of the values inside the `answers` list, because those define the unique string values of the list. — Parseltongue, Oct 25 '19 at 15:14
Your example `example = '''[{"question":"Which of the...` clearly isn't a well-formed peice of python code, because it gives a syntax error. — barny, Oct 25 '19 at 15:15
I just successfully ran the code twice. What version of Python are you using? — Parseltongue, Oct 25 '19 at 15:17
Python 3.7.2 - I catch the SyntaxError exception and print the backtrace — barny, Oct 25 '19 at 15:17
You're getting a SyntaxError merely defining a string variable? That's weird. I'm on Python 3.6.5, and this works fine: ```example = '''[{"question":"Which of the following is NOT considered to be "interstate commerce" by the Supreme Court, and this cannot be regulated by Congress?","category":"DFB","answers":["ANSWER 1","ANSWER 2","ANSWER 3","All of these are considered "Interstate Commerce""],"sources":["SOURCE 1","SOURCE 2","SOURCE 3"]}]'''``` — Parseltongue, Oct 25 '19 at 15:19
Yes that would be weird if it were what is happening. SyntaxError is from the eval parsing the string representation of the list. — barny, Oct 25 '19 at 19:35

Charif DZ · Accepted Answer · 2019-10-25T15:29:16.050

1

Try this I know this will work for all question and category key value, and I hope I didn't forgot any case for the lists value:

import re


def escape_quotes(match):
    """ espace normal quotes captured by the second group."""
    # match any quote except this quotes : `["` or `","` or `"]`
    RE_ESACEP_QUOTES_IN_LIST = re.compile('(?<!\[)(?<!",)"(?!,"|\])')

    def escape_quote_in_string(string):
        return '"{}"'.format(string[1:-1].replace('"', "'"))

    key, value = match.groups()
    # this will fix for sure the problem related to this keys
    if any(e in key for e in ('question', 'category')):
        value = escape_quote_in_string(value)
    if any(e in key for e in ('answers', 'sources')):
        # keep only [" or "," or "]  escape any thing else
        value = RE_ESACEP_QUOTES_IN_LIST.sub(r"'", value)

    return f'{key}{value}'


# test cases
exps = ['''[{"question":"What does "AR" stand for?"}]''',
        '''[{"sources":[""SOWE"LL: Ex"ploding myths""]}]''',
        '''[{"question":"Test ", Test" Que"sti"on?","sources":[""SOWELL: Ex""ploding myths""]}]''']

# extract key value of the expression you made it easy by specifying that key are fixed
key = '(?:"(?:question|category|answers|sources)":)'
RE_KEY_VALUE = re.compile(rf'({key})(.+?)\s*(?=,\s*{key}|}})', re.S)

# test all cases
for exp in exps:
    # escape normal quotes
    exp = RE_KEY_VALUE.sub(escape_quotes, exp)
    print(eval(exp))

# [{'question': "What does 'AR' stand for?"}]
# [{'sources': ["'SOWE'LL: Ex'ploding myths'"]}]
# [{'question': "Test ', Test' Que'sti'on?", 'sources': ["'SOWELL: Ex''ploding myths'"]}]

edited Oct 25 '19 at 15:29

answered Oct 25 '19 at 15:22

Charif DZ

13,500
3
13
36

Sorry about the mess in `escape_quote_in_string` method Fixed that, I tried this on other cases still working, but If you find a case that is not working it will be easy to fix the code to handle it. – Charif DZ Oct 25 '19 at 15:31
Thanks so much! Testing this now, but do you anticipate this successfully handling quotes around the values of the source and answer lists? So all dictionaries will have an answers array that will look like: ```"answers": ["A", "B", "C", "Don't pick "D""]``` I'm worried this might not capture the double quotes around the letter 'D" – Parseltongue Oct 25 '19 at 15:33
Just checked. This works fabulously! Thanks so much! – Parseltongue Oct 25 '19 at 15:36
@Parseltongue The only quotes that will not be esacaped is `[", ",", "]` so `"""]` the first quotes will be escaped and the second one no because it's followed by `]`, a very nice question glade I helped you ^^ – Charif DZ Oct 25 '19 at 15:41
1

Can't tell you how grateful I am! You saved me easily two days of work, manually deleting the duplicate quotes. – Parseltongue Oct 25 '19 at 15:43
Thank you ^^, just one small change need to be done, ignore space before the value of Key. Add `\s*` before the second group, `'({key})\s*(.+?)\s*(?=,\s*{key}|}})`. – Charif DZ Oct 25 '19 at 16:08

score 0 · Answer 3 · answered Oct 25 '19 at 13:57

0

If your text is stored in variable somehow, say in variable text, you can use the re.sub():

re.sub('(\s")|("\s)', ' ', text)

answered Oct 25 '19 at 13:57

zipa

24,366
6
30
49

This doesn't actually appear to make the modification. For example ```test = '''[{"question":"Test Question, Test Question?","category":"DFB","answers":["2004","1930","1981","This has never occurred"],"sources":[""SOWELL: Exploding myths""]}]'''``` ```re.sub('(\s")|("\s)', ' ', test)``` ```Out[153]: '[{"question":"Test Question, Test Question?","category":"DFB","answers":["2004","1930","1981","This has never occurred"],"sources":[""SOWELL: Exploding myths""]}]'``` – Parseltongue Oct 25 '19 at 14:15

Identify all instances of problematic quotation marks

3 Answers3