-1

I am new to python. I was going through a repository on gitHub , and I saw the following line of code to extract all URLs from a webpage. I understand Regular expressions and capture groups , but I don't understand why there are extra double quotation marks enclosed within the single quotation marks?

links = re.findall('"((http|ftp)s?://.*?)"', html)

That is, how is it different from the following code ?

links = re.findall('((http|ftp)s?://.*?)', html)

I tried experimenting and saw that only the first one matches the URL syntax correctly but the second one doesn't . But I don't understand why.

Any help is appreciated.

Thank you.

nilanjanaLodh
  • 233
  • 3
  • 9

1 Answers1

1

The double quotes are part of the regex. They ensure that the pattern only matches if it is actually surrounded by quotes; so foo bar http://whatever.com wouldn't match, but <a href="http://whatever.com"> will.

Note this is a really fragile way of doing things, though, since single quotes are also valid in HTML but wouldn't match the regex.

Daniel Roseman
  • 541,889
  • 55
  • 754
  • 786