1

I'm trying to extract a simple sentence from a string delimited with a # character.

str = "#text text text \n text#"

with this pattern

pattern = '#(.+)#'

now, the funny thing is that regular expression isn't matched when the string contains newline character

out = re.findall(pattern, str) # out contains empty []

but if I remove \n from string it works fine.Any idea how to fix this ?

phant0m
  • 15,502
  • 4
  • 40
  • 77
Zed
  • 4,821
  • 6
  • 34
  • 71
  • 1
    Careful: regular expressions are greedy. A string like `"#text text \n text##"` will be matched with the second `#` included. Use Dima's solution to avoid that, or use the non-greedy variant: `'#(.+?)#'` with `re.DOTALL`. –  Dec 12 '12 at 15:16
  • @Evert http://stackoverflow.com/questions/13842633/python-regular-expression-fails-if-newline-included#comment19053779_13842679 ;) – phant0m Dec 12 '12 at 15:20
  • @phant0m I don't get your point. That answer still has the greediness caveat. –  Dec 12 '12 at 15:43
  • @Evert How so? It can't match any `#`s in between the two delimiting `#`s, which essentially makes it non-greedy. – phant0m Dec 12 '12 at 15:49
  • You know, I am glad I haven't had use for string matching yet. Regular expressions look like black magic to me. – arynaq Dec 12 '12 at 16:18
  • @phant0m Have you tried? `>>> import re; re.findall('#(.+)#', "#text text \n text##", re.DOTALL)` results in `['text text \n text#']`, matching the second #. –  Dec 12 '12 at 17:56
  • @Evert I linked to this regex: `#([^#]+)#`. Sorry for the confusion. Nevermind. – phant0m Dec 12 '12 at 17:57

4 Answers4

6

Also pass the re.DOTALL flag, which makes the . match truly everything.

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

phant0m
  • 15,502
  • 4
  • 40
  • 77
5

Use re.DOTALL if you want your . to match newline also: -

>>> out = re.findall('#(.+)#', my_str, re.DOTALL)
>>> out
['text text text \n text']

Also, it's not a good idea to use built-in names as your variable names. Use my_str instead of str.

Rohit Jain
  • 195,192
  • 43
  • 369
  • 489
2

Try this regex "#([^#]+)#"

It will match everything between the delimiters.

StoryTeller - Unslander Monica
  • 148,497
  • 21
  • 320
  • 399
  • This does not have the same semantics. This will stop the match at the first `#`, instead of the last `#`, which may probably be what the OP intended. So +1 for that. – phant0m Dec 12 '12 at 15:13
0

Add the DOTALL flag to your compile or match.

Ignacio Vazquez-Abrams
  • 699,552
  • 132
  • 1,235
  • 1,283