How do I ensure that re.findall() stops at the right place?

Question

Here is the code I have:

a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)

The result is:

[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]

If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.

My question is, how do I limit findall to a single <title></title>?

You can use BeautifulSoup to parse HTML instead of Regex – Achrome Jul 20 '13 at 19:16 — Achrome, Jul 20 '13 at 19:16
http://stackoverflow.com/a/1732454/193892 – Prof. Falken Aug 21 '13 at 06:54 — Prof. Falken, Aug 21 '13 at 06:54

Jon Clements · Answer 1 · 2013-07-21T01:00:35.507

13

Use re.search instead of re.findall if you only want one match:

>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'

If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):

print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']

But really consider using BeautifulSoup or lxml or similar to parse HTML.

edited Jul 21 '13 at 01:00

answered Jul 20 '13 at 19:16

Jon Clements

3

It's true that using regexen to parse HTML or XML is usually a bad idea. – Chip Camden Jul 20 '13 at 23:33

score 5 · Answer 2 · answered Jul 20 '13 at 19:21

Use a non-greedy search instead:

r'<(title)>(.*?)<(/title)>'

The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.

score 2 · Answer 3 · answered Jul 20 '13 at 19:16

2

re.findall(r'<(title)>(.*?)<(/title)>', a)

Add a ? after the *, so it will be non-greedy.

answered Jul 20 '13 at 19:16

zhangyangyu

score 1 · Answer 4 · answered May 21 '14 at 08:55

1

It will be much easier using BeautifulSoup module.

answered May 21 '14 at 08:55

Codengine

4 Answers4