72

I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it.

How does it all begin from the beginning. Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question). As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn't follow the earlier visited pages.

What if the same page has 2 names (URLs) say in these days when we have URL shorteners etc..

I have taken Google as an example. Though Google doesn't leak how its web crawler algorithms and page ranking etc work, but any guesses?

Kara
  • 5,650
  • 15
  • 48
  • 55
xyz
  • 7,885
  • 15
  • 61
  • 88

10 Answers10

84

If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper:

In the course of extracting links, any Web crawler will encounter multiple links to the same document. To avoid downloading and processing a document multiple times, a URL-seen test must be performed on each extracted link before adding it to the URL frontier. (An alternative design would be to instead perform the URL-seen test when the URL is removed from the frontier, but this approach would result in a much larger frontier.)

To perform the URL-seen test, we store all of the URLs seen by Mercator in canonical form in a large table called the URL set. Again, there are too many entries for them all to fit in memory, so like the document fingerprint set, the URL set is stored mostly on disk.

To save space, we do not store the textual representation of each URL in the URL set, but rather a fixed-sized checksum. Unlike the fingerprints presented to the content-seen test’s document fingerprint set, the stream of URLs tested against the URL set has a non-trivial amount of locality. To reduce the number of operations on the backing disk file, we therefore keep an in-memory cache of popular URLs. The intuition for this cache is that links to some URLs are quite common, so caching the popular ones in memory will lead to a high in-memory hit rate.

In fact, using an in-memory cache of 2^18 entries and the LRU-like clock replacement policy, we achieve an overall hit rate on the in-memory cache of 66.2%, and a hit rate of 9.5% on the table of recently-added URLs, for a net hit rate of 75.7%. Moreover, of the 24.3% of requests that miss in both the cache of popular URLs and the table of recently-added URLs, about 1=3 produce hits on the buffer in our random access file implementation, which also resides in user-space. The net result of all this buffering is that each membership test we perform on the URL set results in an average of 0.16 seek and 0.17 read kernel calls (some fraction of which are served out of the kernel’s file system buffers). So, each URL set membership test induces one-sixth as many kernel calls as a membership test on the document fingerprint set. These savings are purely due to the amount of URL locality (i.e., repetition of popular URLs) inherent in the stream of URLs encountered during a crawl.

Basically they hash all of the URLs with a hashing function that guarantees unique hashes for each URL and due to the locality of URLs, it becomes very easy to find URLs. Google even open-sourced their hashing function: CityHash

WARNING!
They might also be talking about bot traps!!! A bot trap is a section of a page that keeps generating new links with unique URLs and you will essentially get trapped in an "infinite loop" by following the links that are being served by that page. This is not exactly a loop, because a loop would be the result of visiting the same URL, but it's an infinite chain of URLs which you should avoid crawling.

Update 12/13/2012- the day after the world was supposed to end :)

Per Fr0zenFyr's comment: if one uses the AOPIC algorithm for selecting pages, then it's fairly easy to avoid bot-traps of the infinite loop kind. Here is a summary of how AOPIC works:

  1. Get a set of N seed pages.
  2. Allocate X amount of credit to each page, such that each page has X/N credit (i.e. equal amount of credit) before crawling has started.
  3. Select a page P, where the P has the highest amount of credit (or if all pages have the same amount of credit, then crawl a random page).
  4. Crawl page P (let's say that P had 100 credits when it was crawled).
  5. Extract all the links from page P (let's say there are 10 of them).
  6. Set the credits of P to 0.
  7. Take a 10% "tax" and allocate it to a Lambda page.
  8. Allocate an equal amount of credits each link found on page P from P's original credit - the tax: so (100 (P credits) - 10 (10% tax))/10 (links) = 9 credits per each link.
  9. Repeat from step 3.

Since the Lambda page continuously collects tax, eventually it will be the page with the largest amount of credit and we'll have to "crawl" it. I say "crawl" in quotes, because we don't actually make an HTTP request for the Lambda page, we just take its credits and distribute them equally to all of the pages in our database.

Since bot traps only give internal links credits and they rarely get credit from the outside, they will continually leak credits (from taxation) to the Lambda page. The Lambda page will distribute that credits out to all of the pages in the database evenly and upon each cycle the bot trap page will lose more and more credits, until it has so little credits that it almost never gets crawled again. This will not happen with good pages, because they often get credits from back-links found on other pages. This also results in a dynamic page rank and what you will notice is that any time you take a snapshot of your database, order the pages by the amount of credits they have, then they will most likely be ordered roughly according to their true page rank.

This only avoid bot traps of the infinite-loop kind, but there are many other bot traps which you should watch out for and there are ways to get around them too.

Community
  • 1
  • 1
Kiril
  • 37,748
  • 29
  • 161
  • 218
  • Excellent explanation. I had the same question in mind about loops (was answered above) and bot traps (still searching for a nice way to get around). I'd have given an additional +1 for CityHash, if SO allowed me. Cheers ;) – Fr0zenFyr Dec 14 '12 at 04:47
  • @Fr0zenFyr You don't have to worry about bot traps of the infinite-loop kind, especially if you use the [AOPIC](http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html) algorithm for selecting URLs to crawl. I'll update my answer with a bit more detail. – Kiril Dec 14 '12 at 05:57
  • @Fr0zenFyr So the best way to avoid bot traps is to crawl politely, otherwise you'll have to take a look at [all the ways you can get trapped](http://stackoverflow.com/questions/8404775/how-to-identify-web-crawler/8405803#8405803) and work around them. I.e. you basically have to implement a browser, use proxies, and imitate multiple browsers by switching user-agents (in accordance with the [browser usage statistics](http://gs.statcounter.com/)) – Kiril Dec 14 '12 at 06:30
  • My current model completely follows robots.txt, no-follow etc and doesn't do aggressive crawl. Thanks for the update on your post, I'll try your suggestion on AOPIC. By, the way, mayan calendar judgement day is 21Dec2012 [rolling eyes].. ;) – Fr0zenFyr Dec 14 '12 at 07:04
  • @Fr0zenFyr ROFL, that's how much I follow the end of the world stuff I guess :) – Kiril Dec 14 '12 at 16:02
  • Dumb question maybe, but what is the Lambda page that tax is allocated to? It seems to be introduced without explanation. – Casper Jan 28 '14 at 19:20
  • @Casper It's just a placeholder to accumulate tax credits. It doesn't represent a real page; think of it as a sentinel in the data structure that causes all of the tax accumulated credit to be redistributed over the other "real" pages when it is crawled. – Filipe Gonçalves Jul 26 '15 at 18:28
  • @Kiril Isn't distributing the tax place holder's credits to all the urls collected difficult to scale? Suddenly you need to make billions of updates! How would you scale it? – raju Jun 29 '20 at 23:20
  • 1
    @raju that doesn't happen on every cycle, it only happens once you "crawl" the lambda. "Crawling" the lambda shouldn't happen very often and you can do it asynchronously. It doesn't need to happen in real-time, it just needs to happen eventually. – Kiril Jun 30 '20 at 16:40
7

While everybody here already suggested how to create your web crawler, here is how how Google ranks pages.

Google gives each page a rank based on the number of callback links (how many links on other websites point to a specific website/page). This is called relevance score. This is based on the fact that if a page has many other pages link to it, it's probably an important page.

Each site/page is viewed as a node in a graph. Links to other pages are directed edges. A degree of a vertex is defined as the number of incoming edges. Nodes with a higher number of incoming edges are ranked higher.

Here's how the PageRank is determined. Suppose that page Pj has Lj links. If one of those links is to page Pi, then Pj will pass on 1/Lj of its importance to Pi. The importance ranking of Pi is then the sum of all the contributions made by pages linking to it. So if we denote the set of pages linking to Pi by Bi, then we have this formula:

Importance(Pi)= sum( Importance(Pj)/Lj ) for all links from Pi to Bi

The ranks are placed in a matrix called hyperlink matrix: H[i,j]

A row in this matrix is either 0, or 1/Lj if there is a link from Pi to Bi. Another property of this matrix is that if we sum all rows in a column we get 1.

Now we need multiply this matrix by an Eigen vector, named I (with eigen value 1) such that:

I = H*I

Now we start iterating: IH, IIH, IIIH .... I^k *H until the solution converges. ie we get pretty much the same numbers in the matrix in step k and k+1.

Now whatever is left in the I vector is the importance of each page.

For a simple class homework example see http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html

As for solving the duplicate issue in your interview question, do a checksum on the entire page and use either that or a bash of the checksum as your key in a map to keep track of visited pages.

Adrian
  • 5,416
  • 7
  • 45
  • 78
  • 1
    Checksum could be different if the page spits out dynamic content. – edocetirwi Apr 23 '13 at 22:49
  • @edocetirwi good point, I guess you'd have to look for sth else or combine it w the URL in some meaningful way – Adrian Apr 24 '13 at 13:32
  • oh, so you just integrate over the `hyperlink matrix` which has the dimensions `every-webpage-on-the-internet` x `every-webpage-on-the-internet`. Easy?!? How does one do that exactly (given its a _very_ sparse matrix)? – CpILL Feb 18 '18 at 20:12
  • @CpILL you're 7 years late, but there are smart ways to multiply large matrices without blowing up; If you want a production ready solution I am willing to accept payment – Adrian Feb 27 '18 at 00:05
  • @Adrian I'm sure you are... but I've noticed it is mostly developers on Stackoverflow and we like to do it ourselves, that is why we're here! :D – CpILL May 04 '18 at 04:07
1

Depends on how deep their question was intended to be. If they were just trying to avoid following the same links back and forth, then hashing the URL's would be sufficient.

What about content that has literally thousands of URL's that lead to the same content? Like a QueryString parameter that doesn't affect anything, but can have an infinite number of iterations. I suppose you could hash the contents of the page as well and compare URL's to see if they are similar to catch content that is identified by multiple URL's. See for example, Bot Traps mentioned in @Lirik's post.

mellamokb
  • 53,762
  • 11
  • 101
  • 131
  • This takes me to another question I have had. How do we hash whole content of a page. such pages are say at least 2 pagers. What kind of hash functions are able to hash 2 pagers to a single value? What is the size of a typically such hash output? – xyz Apr 30 '11 at 10:02
0

The crawler keeps a URL pool that contains all the URLs to be crawled. To avoid “infinite loop”, the basic idea is to check the existence of each URL before adding to the pool.

However, this is not easy to implement when the system has scaled to certain level. The naive approach is keeping all the URLs in a hashset and check existence of each new URL. This won't work when there are too many URLs to fit into memory.

There are a couple of solutions here. For instance, instead of storing all the URLs into memory, we should keep them in disk. To save space, URL hash should be used instead of the raw URL. It's also worth to note that we should keep the canonical form of URL rather than the original one. So if the URL is shortened by services like bit.ly, it's better to get the final URL. To speed up the checking process, a cache layer can be built. Or you can see it as a distributed cache system, which is a separate topic.

The post Build a Web Crawler has a detailed analysis of this problem.

Mark
  • 1
0

I have also required to use crawler and cant find proper one for my requirement so after that I have developed basic crawler library in order to implement simple requirements. But enabling to fulfill almost all principles of crawler. You can check DotnetCrawler github repo which implement Downloader-Processor-Pipeline modules in its own with default implementation using Entity Framework Core in order to save data into Sql Server.

https://github.com/mehmetozkaya/DotnetCrawler

0

The web crawler is a computer program which used to collect/crawling following key values(HREF links, Image links, Meta Data .etc) from given website URL. It is designed like intelligent to follow different HREF links which are already fetched from the previous URL, so in this way, Crawler can jump from one website to other websites. Usually, it called as a Web spider or Web Bot. This mechanism always acts as a backbone of the Web search engine.

Please find the source code from my tech blog - http://www.algonuts.info/how-to-built-a-simple-web-crawler-in-php.html

<?php
class webCrawler
{
    public $siteURL;
    public $error;

    function __construct()
    {
        $this->siteURL = "";
        $this->error = "";
    }

    function parser()   
    {
        global $hrefTag,$hrefTagCountStart,$hrefTagCountFinal,$hrefTagLengthStart,$hrefTagLengthFinal,$hrefTagPointer;
        global $imgTag,$imgTagCountStart,$imgTagCountFinal,$imgTagLengthStart,$imgTagLengthFinal,$imgTagPointer;
        global $Url_Extensions,$Document_Extensions,$Image_Extensions,$crawlOptions;

        $dotCount = 0;
        $slashCount = 0;
        $singleSlashCount = 0;
        $doubleSlashCount = 0;
        $parentDirectoryCount = 0;

        $linkBuffer = array();

        if(($url = trim($this->siteURL)) != "")
        {
            $crawlURL = rtrim($url,"/");
            if(($directoryURL = dirname($crawlURL)) == "http:")
            {   $directoryURL = $crawlURL;  }
            $urlParser = preg_split("/\//",$crawlURL);

            //-- Curl Start --
            $curlObject = curl_init($crawlURL);
            curl_setopt_array($curlObject,$crawlOptions);
            $webPageContent = curl_exec($curlObject);
            $errorNumber = curl_errno($curlObject);
            curl_close($curlObject);
            //-- Curl End --

            if($errorNumber == 0)
            {
                $webPageCounter = 0;
                $webPageLength = strlen($webPageContent);
                while($webPageCounter < $webPageLength)
                {
                    $character = $webPageContent[$webPageCounter];
                    if($character == "")
                    {   
                        $webPageCounter++;  
                        continue;
                    }
                    $character = strtolower($character);
                    //-- Href Filter Start --
                    if($hrefTagPointer[$hrefTagLengthStart] == $character)
                    {
                        $hrefTagLengthStart++;
                        if($hrefTagLengthStart == $hrefTagLengthFinal)
                        {
                            $hrefTagCountStart++;
                            if($hrefTagCountStart == $hrefTagCountFinal)
                            {
                                if($hrefURL != "")
                                {
                                    if($parentDirectoryCount >= 1 || $singleSlashCount >= 1 || $doubleSlashCount >= 1)
                                    {
                                        if($doubleSlashCount >= 1)
                                        {   $hrefURL = "http://".$hrefURL;  }
                                        else if($parentDirectoryCount >= 1)
                                        {
                                            $tempData = 0;
                                            $tempString = "";
                                            $tempTotal = count($urlParser) - $parentDirectoryCount;
                                            while($tempData < $tempTotal)
                                            {
                                                $tempString .= $urlParser[$tempData]."/";
                                                $tempData++;
                                            }
                                            $hrefURL = $tempString."".$hrefURL;
                                        }
                                        else if($singleSlashCount >= 1)
                                        {   $hrefURL = $urlParser[0]."/".$urlParser[1]."/".$urlParser[2]."/".$hrefURL;  }
                                    }
                                    $host = "";
                                    $hrefURL = urldecode($hrefURL);
                                    $hrefURL = rtrim($hrefURL,"/");
                                    if(filter_var($hrefURL,FILTER_VALIDATE_URL) == true)
                                    {   
                                        $dump = parse_url($hrefURL);
                                        if(isset($dump["host"]))
                                        {   $host = trim(strtolower($dump["host"]));    }
                                    }
                                    else
                                    {
                                        $hrefURL = $directoryURL."/".$hrefURL;
                                        if(filter_var($hrefURL,FILTER_VALIDATE_URL) == true)
                                        {   
                                            $dump = parse_url($hrefURL);    
                                            if(isset($dump["host"]))
                                            {   $host = trim(strtolower($dump["host"]));    }
                                        }
                                    }
                                    if($host != "")
                                    {
                                        $extension = pathinfo($hrefURL,PATHINFO_EXTENSION);
                                        if($extension != "")
                                        {
                                            $tempBuffer ="";
                                            $extensionlength = strlen($extension);
                                            for($tempData = 0; $tempData < $extensionlength; $tempData++)
                                            {
                                                if($extension[$tempData] != "?")
                                                {   
                                                    $tempBuffer = $tempBuffer.$extension[$tempData];
                                                    continue;
                                                }
                                                else
                                                {
                                                    $extension = trim($tempBuffer);
                                                    break;
                                                }
                                            }
                                            if(in_array($extension,$Url_Extensions))
                                            {   $type = "domain";   }
                                            else if(in_array($extension,$Image_Extensions))
                                            {   $type = "image";    }
                                            else if(in_array($extension,$Document_Extensions))
                                            {   $type = "document"; }
                                            else
                                            {   $type = "unknown";  }
                                        }
                                        else
                                        {   $type = "domain";   }

                                        if($hrefURL != "")
                                        {
                                            if($type == "domain" && !in_array($hrefURL,$this->linkBuffer["domain"]))
                                            {   $this->linkBuffer["domain"][] = $hrefURL;   }
                                            if($type == "image" && !in_array($hrefURL,$this->linkBuffer["image"]))
                                            {   $this->linkBuffer["image"][] = $hrefURL;    }
                                            if($type == "document" && !in_array($hrefURL,$this->linkBuffer["document"]))
                                            {   $this->linkBuffer["document"][] = $hrefURL; }
                                            if($type == "unknown" && !in_array($hrefURL,$this->linkBuffer["unknown"]))
                                            {   $this->linkBuffer["unknown"][] = $hrefURL;  }
                                        }
                                    }
                                }
                                $hrefTagCountStart = 0;
                            }
                            if($hrefTagCountStart == 3)
                            {
                                $hrefURL = "";
                                $dotCount = 0;
                                $slashCount = 0;
                                $singleSlashCount = 0;
                                $doubleSlashCount = 0;
                                $parentDirectoryCount = 0;
                                $webPageCounter++;
                                while($webPageCounter < $webPageLength)
                                {
                                    $character = $webPageContent[$webPageCounter];
                                    if($character == "")
                                    {   
                                        $webPageCounter++;  
                                        continue;
                                    }
                                    if($character == "\"" || $character == "'")
                                    {
                                        $webPageCounter++;
                                        while($webPageCounter < $webPageLength)
                                        {
                                            $character = $webPageContent[$webPageCounter];
                                            if($character == "")
                                            {   
                                                $webPageCounter++;  
                                                continue;
                                            }
                                            if($character == "\"" || $character == "'" || $character == "#")
                                            {   
                                                $webPageCounter--;  
                                                break;  
                                            }
                                            else if($hrefURL != "")
                                            {   $hrefURL .= $character; }
                                            else if($character == "." || $character == "/")
                                            {
                                                if($character == ".")
                                                {
                                                    $dotCount++;
                                                    $slashCount = 0;
                                                }
                                                else if($character == "/")
                                                {
                                                    $slashCount++;
                                                    if($dotCount == 2 && $slashCount == 1)
                                                    $parentDirectoryCount++;
                                                    else if($dotCount == 0 && $slashCount == 1)
                                                    $singleSlashCount++;
                                                    else if($dotCount == 0 && $slashCount == 2)
                                                    $doubleSlashCount++;
                                                    $dotCount = 0;
                                                }
                                            }
                                            else
                                            {   $hrefURL .= $character; }
                                            $webPageCounter++;
                                        }
                                        break;
                                    }
                                    $webPageCounter++;
                                }
                            }
                            $hrefTagLengthStart = 0;
                            $hrefTagLengthFinal = strlen($hrefTag[$hrefTagCountStart]);
                            $hrefTagPointer =& $hrefTag[$hrefTagCountStart];
                        }
                    }
                    else
                    {   $hrefTagLengthStart = 0;    }
                    //-- Href Filter End --
                    //-- Image Filter Start --
                    if($imgTagPointer[$imgTagLengthStart] == $character)
                    {
                        $imgTagLengthStart++;
                        if($imgTagLengthStart == $imgTagLengthFinal)
                        {
                            $imgTagCountStart++;
                            if($imgTagCountStart == $imgTagCountFinal)
                            {
                                if($imgURL != "")
                                {
                                    if($parentDirectoryCount >= 1 || $singleSlashCount >= 1 || $doubleSlashCount >= 1)
                                    {
                                        if($doubleSlashCount >= 1)
                                        {   $imgURL = "http://".$imgURL;    }
                                        else if($parentDirectoryCount >= 1)
                                        {
                                            $tempData = 0;
                                            $tempString = "";
                                            $tempTotal = count($urlParser) - $parentDirectoryCount;
                                            while($tempData < $tempTotal)
                                            {
                                                $tempString .= $urlParser[$tempData]."/";
                                                $tempData++;
                                            }
                                            $imgURL = $tempString."".$imgURL;
                                        }
                                        else if($singleSlashCount >= 1)
                                        {   $imgURL = $urlParser[0]."/".$urlParser[1]."/".$urlParser[2]."/".$imgURL;    }
                                    }
                                    $host = "";
                                    $imgURL = urldecode($imgURL);
                                    $imgURL = rtrim($imgURL,"/");
                                    if(filter_var($imgURL,FILTER_VALIDATE_URL) == true)
                                    {   
                                        $dump = parse_url($imgURL); 
                                        $host = trim(strtolower($dump["host"]));
                                    }
                                    else
                                    {
                                        $imgURL = $directoryURL."/".$imgURL;
                                        if(filter_var($imgURL,FILTER_VALIDATE_URL) == true)
                                        {   
                                            $dump = parse_url($imgURL); 
                                            $host = trim(strtolower($dump["host"]));
                                        }   
                                    }
                                    if($host != "")
                                    {
                                        $extension = pathinfo($imgURL,PATHINFO_EXTENSION);
                                        if($extension != "")
                                        {
                                            $tempBuffer ="";
                                            $extensionlength = strlen($extension);
                                            for($tempData = 0; $tempData < $extensionlength; $tempData++)
                                            {
                                                if($extension[$tempData] != "?")
                                                {   
                                                    $tempBuffer = $tempBuffer.$extension[$tempData];
                                                    continue;
                                                }
                                                else
                                                {
                                                    $extension = trim($tempBuffer);
                                                    break;
                                                }
                                            }
                                            if(in_array($extension,$Url_Extensions))
                                            {   $type = "domain";   }
                                            else if(in_array($extension,$Image_Extensions))
                                            {   $type = "image";    }
                                            else if(in_array($extension,$Document_Extensions))
                                            {   $type = "document"; }
                                            else
                                            {   $type = "unknown";  }
                                        }
                                        else
                                        {   $type = "domain";   }

                                        if($imgURL != "")
                                        {
                                            if($type == "domain" && !in_array($imgURL,$this->linkBuffer["domain"]))
                                            {   $this->linkBuffer["domain"][] = $imgURL;    }
                                            if($type == "image" && !in_array($imgURL,$this->linkBuffer["image"]))
                                            {   $this->linkBuffer["image"][] = $imgURL; }
                                            if($type == "document" && !in_array($imgURL,$this->linkBuffer["document"]))
                                            {   $this->linkBuffer["document"][] = $imgURL;  }
                                            if($type == "unknown" && !in_array($imgURL,$this->linkBuffer["unknown"]))
                                            {   $this->linkBuffer["unknown"][] = $imgURL;   }
                                        }
                                    }
                                }
                                $imgTagCountStart = 0;
                            }
                            if($imgTagCountStart == 3)
                            {
                                $imgURL = "";
                                $dotCount = 0;
                                $slashCount = 0;
                                $singleSlashCount = 0;
                                $doubleSlashCount = 0;
                                $parentDirectoryCount = 0;
                                $webPageCounter++;
                                while($webPageCounter < $webPageLength)
                                {
                                    $character = $webPageContent[$webPageCounter];
                                    if($character == "")
                                    {   
                                        $webPageCounter++;  
                                        continue;
                                    }
                                    if($character == "\"" || $character == "'")
                                    {
                                        $webPageCounter++;
                                        while($webPageCounter < $webPageLength)
                                        {
                                            $character = $webPageContent[$webPageCounter];
                                            if($character == "")
                                            {   
                                                $webPageCounter++;  
                                                continue;
                                            }
                                            if($character == "\"" || $character == "'" || $character == "#")
                                            {   
                                                $webPageCounter--;  
                                                break;  
                                            }
                                            else if($imgURL != "")
                                            {   $imgURL .= $character;  }
                                            else if($character == "." || $character == "/")
                                            {
                                                if($character == ".")
                                                {
                                                    $dotCount++;
                                                    $slashCount = 0;
                                                }
                                                else if($character == "/")
                                                {
                                                    $slashCount++;
                                                    if($dotCount == 2 && $slashCount == 1)
                                                    $parentDirectoryCount++;
                                                    else if($dotCount == 0 && $slashCount == 1)
                                                    $singleSlashCount++;
                                                    else if($dotCount == 0 && $slashCount == 2)
                                                    $doubleSlashCount++;
                                                    $dotCount = 0;
                                                }
                                            }
                                            else
                                            {   $imgURL .= $character;  }
                                            $webPageCounter++;
                                        }
                                        break;
                                    }
                                    $webPageCounter++;
                                }
                            }
                            $imgTagLengthStart = 0;
                            $imgTagLengthFinal = strlen($imgTag[$imgTagCountStart]);
                            $imgTagPointer =& $imgTag[$imgTagCountStart];
                        }
                    }
                    else
                    {   $imgTagLengthStart = 0; }
                    //-- Image Filter End --
                    $webPageCounter++;
                }
            }
            else
            {   $this->error = "Unable to proceed, permission denied";  }
        }
        else
        {   $this->error = "Please enter url";  }

        if($this->error != "")
        {   $this->linkBuffer["error"] = $this->error;  }

        return $this->linkBuffer;
    }   
}
?>
0

You'd have to have some sort of hash table to store the results in, you'd just have to check it before each page load.

chuchu
  • 1
0

The problem here is not to crawl duplicated URLS, wich is resolved by a index using a hash obtained from urls. The problem is to crawl DUPLICATED CONTENT. Each url of a "Crawler Trap" is different (year, day, SessionID...).

There is not a "perfect" solution... but you can use some of this strategies:

• Keep a field of wich level the url is inside the website. For each cicle of getting urls from a page, increase the level. It will be like a tree. You can stop to crawl at certain level, like 10 (i think google use this).

• You can try to create a kind of HASH wich can be compared to find similar documents, since you cant compare with each document in your database. There are SimHash from google, but i could not find any implementation to use. Then i´ve created my own. My hash count low and high frequency characters inside the html code and generate a 20bytes hash, wich is compared with a small cache of last crawled pages inside a AVLTree with an NearNeighbors search with some tolerance (about 2). You cant use any reference to characters locations in this hash. After "recognize" the trap, you can record the url pattern of the duplicate content and start to ignore pages with that too.

• Like google, you can create a ranking to each website and "trust" more in one than others.

lexmooze
  • 381
  • 3
  • 4
-1

This is a web crawler example. Which can be used to collect mac Addresses for mac spoofing.

#!/usr/bin/env python

import sys
import os
import urlparse
import urllib
from bs4 import BeautifulSoup

def mac_addr_str(f_data):
global fptr
global mac_list
word_array = f_data.split(" ")

    for word in word_array:
        if len(word) == 17 and ':' in word[2] and ':' in word[5] and ':' in word[8] and ':' in word[11] and ':' in word[14]:
            if word not in mac_list:
                mac_list.append(word)
                fptr.writelines(word +"\n")
                print word



url = "http://stackoverflow.com/questions/tagged/mac-address"

url_list = [url]
visited = [url]
pwd = os.getcwd();
pwd = pwd + "/internet_mac.txt";

fptr = open(pwd, "a")
mac_list = []

while len(url_list) > 0:
    try:
        htmltext = urllib.urlopen(url_list[0]).read()
    except:
        url_list[0]
    mac_addr_str(htmltext)
    soup = BeautifulSoup(htmltext)
    url_list.pop(0)
    for tag in soup.findAll('a',href=True):
        tag['href'] = urlparse.urljoin(url,tag['href'])
        if url in tag['href'] and tag['href'] not in visited:
            url_list.append(tag['href'])
            visited.append(tag['href'])

Change the url to crawl more sites......good luck

-1

Well the web is basically a directed graph, so you can construct a graph out of the urls and then do a BFS or DFS traversal while marking the visited nodes so you don't visit the same page twice.

Pepe
  • 6,121
  • 5
  • 24
  • 29
  • 3
    But how do you construct the graph in the first place? if we don't want duplicate nodes i.e. we want only one node for a url, then again you need a way to detect and discard a duplicate while contructing the graph itself.. – xyz Apr 29 '11 at 17:00
  • @learnerforever hmmm yes that is true ... I have honestly only written a simple crawler that handled only about a 100 links so actually going into each page wasn't a huge issue. But yes I can see the problems arising when you apply this to the entire web. Lirik's paper seems worthwhile though... – Pepe Apr 29 '11 at 17:13