Double quotes inside single quotes inside an re expression (python)

Question

I am new to python. I was going through a repository on gitHub , and I saw the following line of code to extract all URLs from a webpage. I understand Regular expressions and capture groups , but I don't understand why there are extra double quotation marks enclosed within the single quotation marks?

links = re.findall('"((http|ftp)s?://.*?)"', html)

That is, how is it different from the following code ?

links = re.findall('((http|ftp)s?://.*?)', html)

I tried experimenting and saw that only the first one matches the URL syntax correctly but the second one doesn't . But I don't understand why.

Any help is appreciated.

Thank you.

Try it out at http://pythex.org/. Or just make some test strings and try it out in the interpreter. — wwii, Jul 10 '16 at 19:48
Best tool to see what the pattern does is [regex101.com](http://regex101.com). — Wiktor Stribiżew, Jul 10 '16 at 19:58
I don't find the duplicate of my question. can you please point me to the exact link ? @WiktorStribiżew — nilanjanaLodh, Jul 10 '16 at 20:03

score 1 · Accepted Answer · answered Jul 10 '16 at 19:52

1

The double quotes are part of the regex. They ensure that the pattern only matches if it is actually surrounded by quotes; so foo bar http://whatever.com wouldn't match, but <a href="http://whatever.com"> will.

Note this is a really fragile way of doing things, though, since single quotes are also valid in HTML but wouldn't match the regex.

answered Jul 10 '16 at 19:52

Daniel Roseman

541,889
55
754
786

Thanks a lot. This answered my question :) – nilanjanaLodh Jul 10 '16 at 20:04

Double quotes inside single quotes inside an re expression (python)

1 Answers1