0
from bs4 import BeautifulSoup
import urllib.request
import re

def getLinks(url):
    html_page = urllib.request.urlopen(url)
    soup = BeautifulSoup(html_page, "html.parser")
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))
    return links

anchors = getLinks("http://madisonmemorial.org/")
for anchor in anchors:
    happens = urllib.request.urlopen(anchor)
    if happens.getcode() == "404":
        print(happens.getcode())
# Click on links and return responses
countMe = len(anchors)
for anchor in anchors:
    i = getLinks(anchor)
    happens = urllib.request.urlopen(i, timeout = 2)
    if happens.getcode() == "404":
        print(happens.getcode())
        countMe += len(i)

print(countMe)

So I really have no idea what to say when it comes to this...I thought it was going to be simple setting up a web scraper, but this is turning into a real challenge. So the second for loop (the first one that has the argument of anchor in anchors) is working just fine and returning the codes...it's the last for loop that is giving me the issue...specifically the line that reads:

happens = urllib.request.urlopen(i, timeout = 2)

Why is the program timing out on the above line, but it is not timing out on the exact same line in the for loop above. And when it times out, it times out dozens of times.
I've looked at this question but that doesn't really help because it is with building a networking app, I did get my try - except syntax and logic down with that question though. I've looked at this question, but it didn't really help me because it wasn't really applicable to the issue, and I looked at this SO question that was trying to accomplish something slightly different

Adam McGurk
  • 456
  • 1
  • 11
  • 39
  • I suggest using `requests` instead of `urllib`. It is cleaner implementation and easier to use. When you put `timeout=2` you're only giving 2 seconds for the page to fully load. Given how fast your connection and the server is, that can quickly become a problem. Try increasing the timeout to 45 or so. – darksky Aug 16 '17 at 01:38
  • @darksky I tried requests and gave me about 30 lines or so of errors, but I will increase my timeout, editing to say that is why I went with `urllib`, because it was the thing I could get to work – Adam McGurk Aug 16 '17 at 01:38
  • 1
    You realise that `i` is a list? – cs95 Aug 16 '17 at 01:45
  • @cᴏʟᴅsᴘᴇᴇᴅ I do, and that's part of the error message that made me set the attribute timeout, but I guess I don't know if setting a timeout is the right way to fix that – Adam McGurk Aug 16 '17 at 01:46
  • Hmm... why don't you iterate over each link in a loop instead? – cs95 Aug 16 '17 at 01:47
  • In plain English, what are you trying to do? It seems you are trying to find the count of dead-links? – darksky Aug 16 '17 at 01:54
  • @darksky yes, I am trying to count every link on the website (that code isn't here, but I've already done that and once I've done this second part, I'm going to re-implement that code) and then I'm trying to find every dead link on the website and count those as well – Adam McGurk Aug 16 '17 at 01:56
  • @cᴏʟᴅsᴘᴇᴇᴅ uh...I thought that's what I was doing in the last loop. Because the way the logic lays out to me is it gets every link on the website and then checks every link for the code, and if the error code is 404, it just prints the error code (I'll implement the counting and the other logging later once I just get past this hump) – Adam McGurk Aug 16 '17 at 01:58

1 Answers1

1

Below code will do what you need. Note that you can recursively follow links, and you will need to specify how deep you want this recursion to be.

import requests
import re
from bs4 import BeautifulSoup

def getLinks(url):
    response = requests.get(url)
    if response.status_code != 200: return []

    html_page = response.content
    soup = BeautifulSoup(html_page, "html.parser")
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))

    # remove duplicates
    links = list(set(links))

    return links

def count_dead_links(url, recursion_depth=0):

    count = 0

    for link in getLinks(url):
        response = requests.get(link)
        if response.status_code == 404:
            count += 1
        else:
            if recursion_depth > 0:
                count += count_dead_links(link, recursion_depth-1)

    return count

# returns count of dead links on the page
print(count_dead_links("http://madisonmemorial.org/"))

# returns count of dead links on the page plus all the dead links 
# on all the pages that result after following links that work.
print(count_dead_links("http://madisonmemorial.org/", 1))
darksky
  • 1,811
  • 16
  • 25
  • So I got two errors when I ran that script: `A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond` and `"Failed to establish a new connection: %s" % e Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond...` These were the same errors I got when I tried to use the requests module. – Adam McGurk Aug 16 '17 at 02:48
  • 1
    @AdamMcGurk a very low `timeout` will result in connection errors. Use multithreading if you want to speed up the execution of the script. – t.m.adam Aug 16 '17 at 03:47
  • @AdamMcGurk you will want to wrap `requests.get()` with a `try`-`except` and catch `ConnectionError` and `ReadTimeout`. – darksky Aug 16 '17 at 07:47
  • 1
    More importantly, If you go deeper in the link tree than depth=1, you might encounter cycles: page A links to page B, page B links back to page A, and you are stuck in infinite recursion. The solution is to keep a list of pages you've already visited and not follow those links again. – darksky Aug 16 '17 at 07:49