13

I know that for parsing I should ideally remove all spaces and linebreaks but I was just doing this as a quick fix for something I was trying and I can't figure out why its not working.. I have wrapped different areas of text in my document with the wrappers like "####1" and am trying to parse based on this but its just not working no matter what I try, I think I am using multiline correctly.. any advice is appreciated

This returns no results at all:

string='
####1
ttteest
####1
ttttteeeestt

####2   

ttest
####2'

import re
pattern = '.*?####(.*?)####'
returnmatch = re.compile(pattern, re.MULTILINE).findall(string)
return returnmatch
Rick
  • 15,305
  • 33
  • 106
  • 160
  • 1
    It won't run period because you're not using multi-line string symbols `'''` or `"""` – Nick T Aug 20 '10 at 20:13
  • ok, I missed this concept completely then, i will dig through the re documentation to find where it mentions this.. thanks – Rick Aug 20 '10 at 20:15
  • 3
    Your assignment to `string` is a syntax error. Did you mean to use `'''`? – msw Aug 20 '10 at 20:15
  • no I'm new to python so I didn't know about the mutline string delimiter – Rick Aug 20 '10 at 20:20

2 Answers2

25

Multiline doesn't mean . will match line return, it means that ^ and $ are limited to lines only

re.M re.MULTILINE

When specified, the pattern character '^' matches at the beginning of the string and at the >beginning of each line (immediately following each newline); and the pattern character '$' >matches at the end of the string and at the end of each line (immediately preceding each >newline). By default, '^' matches only at the beginning of the string, and '$' only at the >end of the string and immediately before the newline (if any) at the end of the string.

re.S or re.DOTALL makes . match even new lines.

Source

http://docs.python.org/

Colin Hebert
  • 85,401
  • 13
  • 150
  • 145
17

Try re.findall(r"####(.*?)\s(.*?)\s####", string, re.DOTALL) (works with re.compile too, of course).

This regexp will return tuples containing the number of the section and the section content.

For your example, this will return [('1', 'ttteest'), ('2', ' \n\nttest')].

(BTW: your example won't run, for multiline strings, use ''' or """)

leoluk
  • 11,103
  • 6
  • 39
  • 49