from bs4 import BeautifulSoup
import urllib.request
import re
def getLinks(url):
html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page, "html.parser")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
anchors = getLinks("http://madisonmemorial.org/")
for anchor in anchors:
happens = urllib.request.urlopen(anchor)
if happens.getcode() == "404":
print(happens.getcode())
# Click on links and return responses
countMe = len(anchors)
for anchor in anchors:
i = getLinks(anchor)
happens = urllib.request.urlopen(i, timeout = 2)
if happens.getcode() == "404":
print(happens.getcode())
countMe += len(i)
print(countMe)
So I really have no idea what to say when it comes to this...I thought it was going to be simple setting up a web scraper, but this is turning into a real challenge. So the second for
loop (the first one that has the argument of anchor in anchors
) is working just fine and returning the codes...it's the last for loop that is giving me the issue...specifically the line that reads:
happens = urllib.request.urlopen(i, timeout = 2)
Why is the program timing out on the above line, but it is not timing out on the exact same line in the for
loop above. And when it times out, it times out dozens of times.
I've looked at this question but that doesn't really help because it is with building a networking app, I did get my try
- except
syntax and logic down with that question though. I've looked at this question, but it didn't really help me because it wasn't really applicable to the issue, and I looked at this SO question that was trying to accomplish something slightly different