2

I am new in python and trying to extract substrings between single quotes. Do you know how to do this with regex?

E.G input

 text = "[(u'apple',), (u'banana',)]"

I want to extract apple and banana as list items like ['apple', 'banana']

Bhargav Rao
  • 41,091
  • 27
  • 112
  • 129
mgokhanbakal
  • 1,483
  • 1
  • 17
  • 25
  • 2
    Why do you want to do this? This smells like an [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – Kevin Mar 19 '15 at 19:00
  • Pre-emptive note to potential answerers: if you give a solution using regex, make sure that it works on tricky strings like `"[(u'this string contains\' an escaped quote mark and\\ an escaped slash',)]"` – Kevin Mar 19 '15 at 19:01
  • 1
    You can try a non greedy regex, `'.*?'` but this does not work with the conditions that Kevin has mentioned. However it works fine with the sample input you have provided – Bhargav Rao Mar 19 '15 at 19:08

3 Answers3

2

You may alternatively use ast.literal_eval then extract the first item by list comprehension:

from ast import literal_eval

text = "[(u'apple',), (u'banana',)]"

literal_eval(text)
Out[3]: [(u'apple',), (u'banana',)]

[t[0] for t in literal_eval(text)]
Out[4]: [u'apple', u'banana']
Anzel
  • 16,984
  • 5
  • 44
  • 48
2
text = "[(u'apple',), (u'banana',)]"   

print(re.findall(r"\(u'(.*?)',\)", text)
['apple', 'banana']

text = "[(u'this string contains\' an escaped quote mark and\\ an escaped slash',)]"
print(re.findall(r"\(u'(.*?)',\)", text)[0])
this string contains' an escaped quote mark and \ an escaped slash
Padraic Cunningham
  • 160,756
  • 20
  • 201
  • 286
2

In the general case, to extract any chars in between single quotes, the most efficient regex approach is

re.findall(r"'([^']*)'", text) # to also extract empty values
re.findall(r"'([^']+)'", text) # to only extract non-empty values

See the regex demo.

Details

  • ' - a single quote (no need to escape inside a double quote string literal)
  • ([^']*) - a capturing group that captures any 0+ (or 1+ if you use + quantifier) chars other than ' (the [^...] is a negated character class that matches any chars other than those specified in the class)
  • ' - a closing single quote.

Note that re.findall only returns captured substrings if capturing groups are specified in the pattern:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Python demo:

import re
text = "[(u'apple',), (u'banana',)]"
print(re.findall(r"'([^']*)'", text))
# => ['apple', 'banana']

Escaped quote support

If you need to support escaped quotes (so as to match abc\'def in 'abc\'def' you will need a regex like

re.findall(r"'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # in case the text contains only "valid" pairs of quotes
re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # if your text is too messed up and there can be "wild" single quotes out there

See regex variation 1 and regex variation 2 demos.

Pattern details

  • (?<!\\) - a negative lookbehind that fails the match if there is a backslash immediately to the left of the current position
  • (?:\\\\)* - 0 or more consecutive double backslashes (since these are not escaping the neighboring character)
  • ' - an open '
  • ([^'\\]*(?:\\.[^'\\]*)*) - Group 1 (what will be returned by re.findall)matching...
    • [^'\\]* - 0 or more chars other than ' and \
    • (?: - start of a non-capturing group that matches
      • \\. - any escaped char (a backslash and any char including line breaks due to the re.DOTALL modifier)
      • [^'\\]* - 0 or more chars other than ' and \
  • )* - ... zero or more times
  • ' - a closing '.

See another Python demo:

import re
text = r"[(u'apple',), (u'banana',)] [(u'apple',), (u'banana',), (u'abc\'def',)] \\'abc''def' \\\'abc   'abc\\\\\'def'"
print(re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text))
# => apple, banana, apple, banana, abc\'def, abc, def, abc\\\\\'def
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397