11

How do I extract all the characters (including newline characters) until the first occurrence of the giver sequence of words? For example with the following input:

input text:

"shantaram is an amazing novel.
It is one of the best novels i have read.
the novel is written by gregory david roberts.
He is an australian"

And the sequence the I want to extract text from shantaram to first occurrence of the which is in the second line.

The output must be-

shantaram is an amazing novel.
It is one of the

I have been trying all morning. I can write the expression to extract all characters until it encounters a specific character but here if I use an expression like:

re.search("shantaram[\s\S]*the", string)

It doesn't match across newline.

Chris Seymour
  • 75,961
  • 24
  • 144
  • 187
AKASH
  • 121
  • 1
  • 1
  • 4
  • Have you tried anything? – Rohit Jain Sep 22 '13 at 11:10
  • 1
    "Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results" – zero323 Sep 22 '13 at 11:10
  • i have been trying from morning. I can write the expression to extract all characters until it encounters a specific character. But here if i use an expression like- re.search("shantaram[\s\S]*the", string) it doesnt work as the is a part of [\s\S] and the extraction is not happening – AKASH Sep 22 '13 at 11:16

3 Answers3

26

You want to use the DOTALL option to match across newlines. From doc.python.org:

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Demo:

In [1]: import re

In [2]: s="""shantaram is an amazing novel.
It is one of the best novels i have read.
the novel is written by gregory david roberts.
He is an australian"""

In [3]: print re.findall('^.*?the',s,re.DOTALL)[0]
shantaram is an amazing novel.
It is one of the
Chris Seymour
  • 75,961
  • 24
  • 144
  • 187
5

Use this regex,

re.search("shantaram[\s\S]*?the", string)

instead of

re.search("shantaram[\s\S]*the", string)

The only difference is '?'. By using '?'(e.g. *?, +?), you can prevent longest matching.

lancif
  • 677
  • 1
  • 7
  • 17
1

A solution not using regex:

from itertools import takewhile
def upto(a_string, stop):
    return " ".join(takewhile(lambda x: x != stop and x != "\n".format(stop), a_string))
rlms
  • 9,314
  • 8
  • 37
  • 58