Web page crawling in C#

Question

I have been given a task to crawl / parse and index available books on many library web page. I usually use HTML Agility Pack and C# to parse web site content. One of them is the following:

http://bibliotek.kristianstad.se/pls/bookit/pkg_www_misc.print_index?in_language_id=en_GB

If you search for a * (all books) it will return many lists of books, paginated by 10 books per page.

Typical web crawlers that I have found fail on this website. I have also tried to write my own crawler, which would go through all links on the page and generate post/get variables to dynamically generate results. I havent been able to do this as well, mostly due to some 404 errors that I get (although I am certain that the links generated are correct).

The site relies on javascript to generate content, and uses a mixed mode of GET and POST variable submission.

Is there a question in here somewhere? – Jim Mischel Dec 22 '10 at 00:31 — Jim Mischel, Dec 22 '10 at 00:31

score 3 · Accepted Answer · answered Dec 22 '10 at 00:08

3

I'm going out on a limb, but try observing the JavaScript GETs and POSTs with Fiddler and then you can base your crawling off of those requests. Fiddler has FiddlerCore, which you can put in your own C# project. Using this, you could monitor requests made in the WebBrowser control and then save them for crawling or whatever, later.

Going down the C# JavaScript interpreter route sounds like the 'more correct' way of doing this, but I wager it will be much harder and frought with errors and bugs unless you have the simplest of cases.

Good luck.

answered Dec 22 '10 at 00:08

Chad

2,920
4
27
39

1

Also, there is a plugin for Fiddler that will generate HttpWebRequest code to re-produce a particular request. Very handy for screen scraping sites that depend on certain headers. (And apparently you wrote it - Jesus what a coincidence!) – Josh Dec 22 '10 at 00:28
@Josh - what a coincidence, right ;-) I happen to be a Fiddler fan. It comes in handy way more often than I would have ever anticipated. Hopefully, the OP can make use of it too. Obligatory link to the plugin - http://www.chadsowald.com/software/fiddler-extension-request-to-code – Chad Dec 22 '10 at 13:46
I think this will solve my problems. I got many variables that I did not see analyzing the scripts of the page. Awesome tool indeed, I hope i knew earlier about it. – user496607 Dec 22 '10 at 19:51

score 2 · Answer 2 · answered Sep 17 '12 at 16:31

FWIW, the C# WebBrowser control is very, very slow. It also doesn't support more than two simultaneous requests.

Using SHDocVw is faster, but is also semaphore limited.

Faster still is using MSHTML. Working code here: https://svn.arachnode.net/svn/arachnodenet/trunk/Renderer/HtmlRenderer.cs Username/Password: Public (doesn't have the request/rendering limitations that the other two have when run out of process...)

This is headless, so none of the controls are rendered. (Faster).

Thanks, Mike

score 1 · Answer 3 · answered Dec 21 '10 at 23:54

If you use the WebBrowser control in a Windows Forms application to open the page then you should be able to access the DOM through the HtmlDocument. That would work for the HTML links.

As for the links that are generated through Javascript, you might look at the ObjectForScripting property which should allow you to interface with the HTML page through Javascript. The rest then becomes a Javascript problem, but it should (in theory) be solvable. I haven't tried this so I can't say.

score 0 · Answer 4 · answered Nov 27 '16 at 18:59

0

AbotX does javascript rendering for you. Its not free though.

answered Nov 27 '16 at 18:59

sjdirect

1,874
1
18
25

score 0 · Answer 5 · edited May 23 '17 at 12:07

0

If the site generates content with JavaScript, then you are out of luck. You need a full JavaScript engine usable in C# so that you can actually execute the scripts and capture the output they generate.

Take a look at this question: Embedding JavaScript engine into .NET -- but know that it will take "serious" effort to do what you need.

edited May 23 '17 at 12:07

Community

1
1

answered Dec 21 '10 at 23:40

Jon

396,160
71
697
768

5

He may be able to automate the WebBrowser control but this leads to early death in most test subjects. – Josh Dec 21 '10 at 23:47
@Josh: wish someone would have told me those wise words a year or so ago :-) – Chad Dec 22 '10 at 00:10
@Josh - yup, this was my thought at the beginning. Thanks for the heads up though :) – user496607 Dec 22 '10 at 19:50

Web page crawling in C#

5 Answers5