3

I have been given a task to crawl / parse and index available books on many library web page. I usually use HTML Agility Pack and C# to parse web site content. One of them is the following:

http://bibliotek.kristianstad.se/pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB

If you search for a * (all books) it will return many lists of books, paginated by 10 books per page.

Typical web crawlers that I have found fail on this website. I have also tried to write my own crawler, which would go through all links on the page and generate post/get variables to dynamically generate results. I havent been able to do this as well, mostly due to some 404 errors that I get (although I am certain that the links generated are correct).

The site relies on javascript to generate content, and uses a mixed mode of GET and POST variable submission.

user496607
  • 442
  • 1
  • 10
  • 21

5 Answers5

3

I'm going out on a limb, but try observing the JavaScript GETs and POSTs with Fiddler and then you can base your crawling off of those requests. Fiddler has FiddlerCore, which you can put in your own C# project. Using this, you could monitor requests made in the WebBrowser control and then save them for crawling or whatever, later.

Going down the C# JavaScript interpreter route sounds like the 'more correct' way of doing this, but I wager it will be much harder and frought with errors and bugs unless you have the simplest of cases.

Good luck.

Chad
  • 2,920
  • 4
  • 27
  • 39
  • 1
    Also, there is a plugin for Fiddler that will generate HttpWebRequest code to re-produce a particular request. Very handy for screen scraping sites that depend on certain headers. (And apparently you wrote it - Jesus what a coincidence!) – Josh Dec 22 '10 at 00:28
  • @Josh - what a coincidence, right ;-) I happen to be a Fiddler fan. It comes in handy way more often than I would have ever anticipated. Hopefully, the OP can make use of it too. Obligatory link to the plugin - http://www.chadsowald.com/software/fiddler-extension-request-to-code – Chad Dec 22 '10 at 13:46
  • I think this will solve my problems. I got many variables that I did not see analyzing the scripts of the page. Awesome tool indeed, I hope i knew earlier about it. – user496607 Dec 22 '10 at 19:51
2

FWIW, the C# WebBrowser control is very, very slow. It also doesn't support more than two simultaneous requests.

Using SHDocVw is faster, but is also semaphore limited.

Faster still is using MSHTML. Working code here: https://svn.arachnode.net/svn/arachnodenet/trunk/Renderer/HtmlRenderer.cs Username/Password: Public (doesn't have the request/rendering limitations that the other two have when run out of process...)

This is headless, so none of the controls are rendered. (Faster).

Thanks, Mike

arachnode.net
  • 821
  • 5
  • 12
1

If you use the WebBrowser control in a Windows Forms application to open the page then you should be able to access the DOM through the HtmlDocument. That would work for the HTML links.

As for the links that are generated through Javascript, you might look at the ObjectForScripting property which should allow you to interface with the HTML page through Javascript. The rest then becomes a Javascript problem, but it should (in theory) be solvable. I haven't tried this so I can't say.

Steve Wortham
  • 20,322
  • 4
  • 62
  • 86
0

AbotX does javascript rendering for you. Its not free though.

sjdirect
  • 1,874
  • 1
  • 18
  • 25
0

If the site generates content with JavaScript, then you are out of luck. You need a full JavaScript engine usable in C# so that you can actually execute the scripts and capture the output they generate.

Take a look at this question: Embedding JavaScript engine into .NET -- but know that it will take "serious" effort to do what you need.

Community
  • 1
  • 1
Jon
  • 396,160
  • 71
  • 697
  • 768