python multiline regular expressions

Question

How do I extract all the characters (including newline characters) until the first occurrence of the giver sequence of words? For example with the following input:

input text:

"shantaram is an amazing novel.
It is one of the best novels i have read.
the novel is written by gregory david roberts.
He is an australian"

And the sequence the I want to extract text from shantaram to first occurrence of the which is in the second line.

The output must be-

shantaram is an amazing novel.
It is one of the

I have been trying all morning. I can write the expression to extract all characters until it encounters a specific character but here if I use an expression like:

re.search("shantaram[\s\S]*the", string)

It doesn't match across newline.

"Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results" — zero323, Sep 22 '13 at 11:10
i have been trying from morning. I can write the expression to extract all characters until it encounters a specific character. But here if i use an expression like- re.search("shantaram[\s\S]*the", string) it doesnt work as the is a part of [\s\S] and the extraction is not happening — AKASH, Sep 22 '13 at 11:16

Chris Seymour · Answer 1 · 2013-09-22T11:18:23.667

You want to use the DOTALL option to match across newlines. From doc.python.org:

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Demo:

In [1]: import re

In [2]: s="""shantaram is an amazing novel.
It is one of the best novels i have read.
the novel is written by gregory david roberts.
He is an australian"""

In [3]: print re.findall('^.*?the',s,re.DOTALL)[0]
shantaram is an amazing novel.
It is one of the

score 5 · Answer 2 · answered Sep 22 '13 at 11:49

5

Use this regex,

re.search("shantaram[\s\S]*?the", string)

instead of

re.search("shantaram[\s\S]*the", string)

The only difference is '?'. By using '?'(e.g. *?, +?), you can prevent longest matching.

answered Sep 22 '13 at 11:49

lancif

677
1
7
17

score 1 · Answer 3 · answered Sep 22 '13 at 11:24

1

A solution not using regex:

from itertools import takewhile
def upto(a_string, stop):
    return " ".join(takewhile(lambda x: x != stop and x != "\n".format(stop), a_string))

answered Sep 22 '13 at 11:24

rlms

9,314
8
37
58

python multiline regular expressions

3 Answers3

Linked