4

Is it possible to have Selenium crawl a TLD and incrementally export a list of any 404's found?

I'm stuck on a Windows machine for a few hrs and want to run some tests before back to the comfort of *nix...

ljs.dev
  • 4,031
  • 2
  • 40
  • 76
  • How are you running your tests? I do something similar with NUnit by exporting testing data to a sql server as it occurs .. but if you're not into windows/ms/.net I could only give you this as a conceptual answer. – Izzy Jun 14 '13 at 08:04
  • tests are running via Python, building on the unittest library. It does execute WebDriver tests on a Windows box and can leverage a db for exporting test data. please post your solution as an answer, so long as it manages to crawl a site and flag 404's, that would fit the bill – ljs.dev Jun 14 '13 at 23:30
  • 1
    Just did a quick search, would **[this](https://github.com/cmwslw/selenium-crawler)** work for you? Sounds like it will go through and get a list for you and you can write a little code around getting the 404's. Although, it does require the links to be exposed. Are you looking for something more like `wget -r`? – kgdesouz Jun 16 '13 at 03:15
  • Thanks wget -r may suffice though the selenium crawler would better suit the requirements of using Selenium, am not able to see an example of using it but as you said, writing some code to work with the 404's would work. – ljs.dev Jun 16 '13 at 03:39
  • This doesn't seem like Selenium is the best fit for this. Is the site running off IIS? – Arran Jun 16 '13 at 09:56
  • 1
    This is a duplication of http://stackoverflow.com/questions/6509628/webdriver-get-http-response-code. Selenium does not support HTTP response codes. Also, its probably easier, safer and faster to do this with urllib2 or httplib2. Unless you really need Selenium for a specific purpose of course.... – Ruben Jun 20 '13 at 05:40
  • Another alternative solution. http://home.snafu.de/tilman/xenulink.html – Squiggs. Jun 20 '13 at 15:46

1 Answers1

1

I don't know Python very well, nor any of its commonly used libraries, but I'd probably do something like this (using C# code for the example, but the concept should apply):

// WARNING! Untested code here. May not completely work, and
// is not guaranteed to even compile.

// Assume "driver" is a validly instantiated WebDriver instance
// (browser used is irrelevant). This API is driver.get in Python,
// I think.
driver.Url = "http://my.top.level.domain/";

// Get all the links on the page and loop through them,
// grabbing the href attribute of each link along the way.
// (Python would be driver.find_elements_by_tag_name)
List<string> linkUrls = new List<string>();
ReadOnlyCollection<IWebElement> links = driver.FindElement(By.TagName("a"));
foreach(IWebElement link in links)
{
    // Nice side effect of getting the href attribute using GetAttribute()
    // is that it returns the full URL, not relative ones.
    linkUrls.Add(link.GetAttribute("href"));
}

// Now that we have all of the link hrefs, we can test to
// see if they're valid.
List<string> validUrls = new List<string>();
List<string> invalidUrls = new List<string>();
foreach(string linkUrl in linkUrls)
{
    HttpWebRequest request = WebRequest.Create(linkUrl) as HttpWebRequest;
    request.Method = "GET";

    // For actual .NET code, you'd probably want to wrap this in a
    // try-catch, and use a null check, in case GetResponse() throws,
    // or returns a type other than HttpWebResponse. For Python, you
    // would use whatever HTTP request library is common.

    // Note also that this is an extremely naive algorithm for determining
    // validity. You could just as easily check for the NotFound (404)
    // status code.
    HttpWebResponse response = request.GetResponse() as HttpWebResponse;
    if (response.StatusCode == HttpStatusCode.OK)
    {
        validUrls.Add(linkUrl);
    }
    else
    {
        invalidUrls.Add(linkUrl);
    }
}

foreach(string invalidUrl in invalidUrls)
{
    // Here is where you'd log out your invalid URLs
}

At this point, you have a list of valid and invalid URLs. You could wrap this all up into a method that you could pass your TLD URL into, and call it recursively with each of the valid URLs. The key bit here is that you're not using Selenium to actually determine the validity of the links. And you wouldn't want to "click" on the links to navigate to the next page, if you're truly doing a recursive crawl. Rather, you'd want to navigate directly to the links found on the page.

There are other approaches you might take, like running everything through a proxy, and capturing the response codes that way. It depends a little on how you expect to structure your solution.

JimEvans
  • 25,799
  • 6
  • 74
  • 104