I was studying an example in a web scraping text book. the function of the web scraper is to get external links on a webpage.
I redefined the function in a simpler format I can understand but one line of regular expression keeps confusing me. the whole function is written below.
url = "http://oreilly.com"
url_parse = urlparse(url)
external_links = set()
def scrape_external(url):
html = urlopen(url)
bsObj = BeautifulSoup(html.read(), "lxml")
linkParse = url_parse.netloc
#this is the line I need some clarity below
externalLinks = bsObj.findAll("a",{"href": re.compile("^(http|www)((?!"+linkParse+").)*$")})
for i in externalLinks:
if "href" in i.attrs:
link = i.attrs['href']
external_links.add(link)
print(external_links)
scrape_external(url)
From my own understanding that regular expression line means "to only match http or www when it's not followed by the home url". but I need more in-depth on how the whole thing works or the logic behind it... I know the meaning of the symbols but I have some troubles putting the whole thing together. particularly the "* and $" symbols.
For instance, why do i need to put the dollar sign at the end and why does it makes so much difference in my results when I remove it.
This is my first question on here and I'm still very new to python. Thanks