urllib2 http status doesn't work with some links

Question

I have been playing around wwith this module (urllib2) for a while now. recently I've managed to make a simple HTTP status checker, that checks the recieved status code of each URL of a given list and removes it if it doesn't give a 200 ok code.

The code is the following one:

 for p in urllist:
    req = urllib2.Request(p)
    try:
        resp = urllib2.urlopen(req)
    except urllib2.HTTPError as e:
        if e.code == 404:
            print str(p)+ " returns 404 error (Not found). This URL will be removed from the list"
            urllist.remove(p)
        elif e.code == 400 or e.code == 401 or e.code == 403:
            print str(p) + " returns a 400 error (Bad request) or 401/403 error (Unauthorized/forbidden) This URL will be removed fromt the list"
            urllist.remove(p)
        elif e.code == 408:
            print str (p) + " returned a 408 error (request timeout) This URL may or may not be available soon, this URL will be kept in the list"
        elif e.code == 429:
            print str(p) + " returned a 429 error (too many requests). The script may have reached a request limit, abort and try again later"           
        elif 500 <= e.code <= 511:
            print str(p) + " returned a 5xx error (server error). servers may be unavailable at the moment. Please abort and try again later"
        elif 410 <= e.code <= 451 or ecode > 511:
            print str(p) + " has returned an unespecified http error. This URL will be removed from the list"
            urllist.remove(p)
        
    except urllib2.URLError as e:
         print str(p) + " returned an unespecified error. This URL will be removed from the list"
         urllist.remove(p)
    else:
        # 200
        body = resp.read()
        print str(p) + " returns a 200 status code (Ok). This URL exists."

The original code comes from this post

Im testing this with bit.ly urls, that are simple and not that tedious to put into a list. Most of them return one or another http status code as expected. but some of them just last 3 times more to be accepted/ removed by the script, One example is bitly/1da2 that pops up a warning when entered.

I've checked with various generated lists of links and the only problem this script has is with urls that have this warning of them. It tries to get the http status code for around... 2 minutes? (I haven't timed it) and then jumps to the next url on the list without removing that link from the list.

I think this can be solved within the URLError part of this script but I'm not sure.

It looks like you are trying to modify the list while you are still iterating over it? Also, why not just use [`requests`](http://docs.python-requests.org/en/master/user/quickstart/)? — G_M, Nov 02 '18 at 21:13
Requests clogs the entire thing at the third url and justs times out. Also isn't it better than iterate through the whole list just to approve/discard one single url? — James, Nov 02 '18 at 22:20
https://stackoverflow.com/questions/1207406/how-to-remove-items-from-a-list-while-iterating — G_M, Nov 02 '18 at 22:25
https://stackoverflow.com/questions/44864393/modify-a-list-while-iterating — G_M, Nov 02 '18 at 22:25
https://stackoverflow.com/questions/10812272/modifying-a-list-while-iterating-over-it-why-not — G_M, Nov 02 '18 at 22:25
Just changed the code to add the approved values to a new list instead of discarding the bad ones from the list that the code is iterating. Thanks for the links anyway, I'll keep that if i need to do this any other time — James, Nov 03 '18 at 11:55

urllib2 http status doesn't work with some links

0 Answers0