3

I am trying to save a web page (just like we do in browsers) along with all its content and formatting. I tried WebClient, WebRequest examples but they can only download the text part and sometimes javascript. But no css and images etc. Is there any api for this in .Net, or any 3rd party api for .net?

It is possible, I think it because a lot applications are running for offline reading, and they show the saved pages with the same formatting and styling. How it is done? Any ideas ?

EDIT 1: Web pages can be parsed and saved using HtmlAgilityPack. But is there any way to get the main article and other contents like ads, other external links separated. Is there any way to differentiate between the contents which are relevant and which are not? (I am sorry, if this question is not clear).

Also can any one give some suggestion that how these offline reading applications (like read later/pocket etc) save a web page and format it.

Is there any way to do the same in C#?

Deeps
  • 517
  • 9
  • 30
  • Mabye this [SO question](http://stackoverflow.com/questions/1263266/c-sharp-find-image-in-html-and-download-them) can help you – tsukimi Jul 17 '12 at 05:38

3 Answers3

4

You can download a Page text as Html, then parse it and get <link rel="stylesheet" type="text/css" href="..."> or <img src="..."/> elements and download link of attributes like href or src separately.

HtmlAgilityPack is a reliable and useful library for parsing Htmls.

carla
  • 1,728
  • 1
  • 30
  • 35
Ria
  • 9,576
  • 3
  • 29
  • 55
  • Hello Ria, I tried HtmlAgilityPack and now I am able to download images and other links like css and js files but still the page doesn't really looks good. There is no formatting. I changed the path of href and src attributes to the local directory. But no effect. – Deeps Jul 18 '12 at 05:07
  • @Deeps: Hello Deeps, are you sure you inserted valid addresses for local files: `file:///` prefix or using `/` instead of `\ ` or uing `%20` instead of space and.... – Ria Jul 18 '12 at 06:29
  • Yes I have checked them. the paths are fine. I think the files for css and js are protected, so, they get downloaded without any text in it. Is there any way to solve it. I tried giving user- agent to the web request for downloading files, but no change. – Deeps Jul 18 '12 at 06:56
2

You could have a look at trying to save the page as an mht file. These files bundles the web page and all of its references, into a single compact file (.mht)

Stackoverflow topic about mht via c#

Note: MHT was introduced by Microsoft. Not all browsers comply with this format. Opera is the only other popular browser that has the MHT save. Firefox users though can call upon two add-ons to handle this file standard, Mozilla Archive Format & UnMHT. Both these add-ons can be installed and used to open and save complete webpages.

Community
  • 1
  • 1
astro boy
  • 1,360
  • 1
  • 10
  • 14