0

I was studying an example in a web scraping text book. the function of the web scraper is to get external links on a webpage.

I redefined the function in a simpler format I can understand but one line of regular expression keeps confusing me. the whole function is written below.

url = "http://oreilly.com"

url_parse = urlparse(url)

external_links = set()

def scrape_external(url):
    html = urlopen(url)
    bsObj = BeautifulSoup(html.read(), "lxml")
    linkParse = url_parse.netloc
    #this is the line I need some clarity below
    externalLinks = bsObj.findAll("a",{"href": re.compile("^(http|www)((?!"+linkParse+").)*$")})
    for i in externalLinks:
        if "href" in i.attrs:
            link = i.attrs['href']
            external_links.add(link)
        
    print(external_links)

scrape_external(url)

From my own understanding that regular expression line means "to only match http or www when it's not followed by the home url". but I need more in-depth on how the whole thing works or the logic behind it... I know the meaning of the symbols but I have some troubles putting the whole thing together. particularly the "* and $" symbols.

For instance, why do i need to put the dollar sign at the end and why does it makes so much difference in my results when I remove it.

This is my first question on here and I'm still very new to python. Thanks

barny
  • 5,280
  • 4
  • 16
  • 21
  • ^ asserts position at start of a line i.e. starts with, * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy), $ asserts position at the end of a line i.e. ends with. Moving, for example, $ changes the meaning. – QHarr Nov 17 '18 at 15:44
  • That's awful code, you want urljoin to convert links not regex, also your url_parse is out of scope, it doesn't do what you think it does. – pguardiario Nov 17 '18 at 22:38

1 Answers1

-1

Here's a regexr link that explains the syntax.

Understanding the Weird Symbols

^ matches the beginning of the string.

$ matches the end of the string.

This makes a lot of difference, since having that would match the entire string, instead of just a substring.

(http|www) matches http OR www

(?! ... ) is what's called a negative lookahead. It "specifies a group that can not match after the main expression (if it matches the result is discarded)".

For example, t(?!s) matches the first t in streets but not the second t because the lookahead found an s after that second t.

. will match any character (alphanumeric, symbols, except the newline '\n').

* will match 0 or more instances of the above, negative lookahead.

[I believe] linkParse turns out to be oreilly.com.

((?!(oreilly.com)).)* will keep matching characters as long as it isn't followed by oreilly.com.

Testing the Regex

So parsing the regex, scrutinising the context, and trying this on my IDLE, we can observe that the regex would match external links

>>> import re
>>> r = re.compile("^(http|www)((?!oreilly.com).)*$")
>>> m = r.match('https://www.google.com')
>>> m
<re.Match object; span=(0, 22), match='https://www.google.com'>
>>> m = r.match('https://www.google.com/food')
>>> m
<re.Match object; span=(0, 27), match='https://www.google.com/food'>
>>> m = r.match('https://oreilly.com/tests')
>>> m
>>> type(m)
<class 'NoneType'>
>>> m = r.match('https://oreally.com')
>>> m
<re.Match object; span=(0, 19), match='https://oreally.com'>

The regex would not match any links containing oreilly.com, so is guaranteed to only return external links. However, it wouldn't match external links containing oreilly.com. For example:

>>> m = r.match('https://www.google.com/search?q=memes+oreilly.com')
>>> m
>>> type(m)
<class 'NoneType'>

So one might question the extent to which it matches external links.

I'm not sure how BeautifulSoup parses the regex, but I'm guessing that it may be similar.

The Dollar Sign $ at the End

You were also wondering about the dollar sign at the end. Here's an example of a internal link being unintentionally matched.

>>> r = re.compile("^(http|www)((?!oreilly.com).)*")
>>> m = r.match('https://oreilly.com/tests')
>>> m
<re.Match object; span=(0, 8), match='https://'>

Why? It's because the regex matched https://, meaning it matched 0 instances of ((?!oreilly.com).). This makes sense, since remember, * means "match 0 or more instances of [an expression]". Now you see why the dollar sign is crucial, because it forces the entire string to be matched.

Community
  • 1
  • 1
TrebledJ
  • 7,200
  • 7
  • 20
  • 43
  • 1
    Thanks for your answer... particularly on the "$" symbol. I will still have to do further studies on how it relates to strings and sub-strings. and your answer pointed me in the right direction. – Abraham Michael Nov 17 '18 at 16:21
  • Thanks a lot for your further answers – Abraham Michael Nov 17 '18 at 18:21
  • No problem. Sometimes it's not as simple as understanding what the symbols mean, rather than what they do in the context of the situation. Glad that you found this useful. – TrebledJ Nov 17 '18 at 18:24