-1

My C# .NET Core console application is a simple web crawler. On pages where the needed data is contained in the source code, I am able to access the needed data. In pages where the data can be copied from the window, viewed in the browser's Page Inspector, but NOT in the source code, I'm stuck.

Chrome Page Inspector shows code needed for download

Please provide code examples of how I can acquire this data.

My current capture code is below:

var htmlCode = string.empty;
using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
{
     // Get the file content without saving it
     htmlCode = client.DownloadString("https://www.wedj.com/dj-photo-video.nsf/firstdance.html");
}

Using the above code, you receive the source code as seen here: enter image description here

The data shown in image 1, as seen from the browser inspector is hidden inside of

<div class="entry row">
Cyphryx
  • 59
  • 12
  • 1
    Please provide code effort of how you tried to acquire this data. – Stefan Aug 08 '18 at 17:51
  • Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: [mcve] – Stefan Aug 08 '18 at 17:52
  • Code added to question – Cyphryx Aug 08 '18 at 17:55
  • So, whats in the `htmlCode` variable? – Stefan Aug 08 '18 at 17:57
  • htmlCode now defined above – Cyphryx Aug 08 '18 at 18:02
  • Hi, no, i mean, after you call `DownloadString`, what's in the htmlCode variable? Doesn't it contain the page? – Stefan Aug 08 '18 at 18:03
  • It only contains the source code. I'll add an image of the difference between what's received via DownloadString versus what you can view in the inspector. – Cyphryx Aug 08 '18 at 18:04
  • htmlCode contents now displayed in question – Cyphryx Aug 08 '18 at 18:08
  • Maybe the site author tried to block crawlers. You might want to try one of the javascript files. Maybe they are calling a public service to fetch the data. – Stefan Aug 08 '18 at 18:09
  • I'm a desktop app dev, not a web dev, hence my question on "i can browse the data in my browser, how do I do it with C#?" There's bound to be somebody that knows dev enough to explain how I can capture the same data the browser is capturing for its inspector. – Cyphryx Aug 08 '18 at 18:15
  • 1
    Ah, yes, but it's a lot of work. You see; your browser runs javascript. For example: a page loads: the source is there. But then some javascript interacts with the page and start fetching some data and altering the page. Browsers can handle that really good because that what they are build for. If you want to build it in code... you're basically building your own browser... with javascript interpreter. That's a lot of code. So... maybe you can try things with a browser-plugin, or analyse the javascript to get the actual data source. Maybe there is a lib somewhere. But it's not easy. – Stefan Aug 08 '18 at 18:20
  • Many Web pages have active content. A WebBrowser can interpret and execute client side Javascript (mainly) code. Some data can be pushed by the a server. In short, you can't scrape any Web page with a non-responsive system. You could use the `WebBrowser` class. It works well enough in most case, but it's difficult to setup correctly, to have IE11/Edge compatibility (you can find info on SO about that). A more evolved client may be needed. See [this question](https://stackoverflow.com/questions/790542/replacing-net-webbrowser-control-with-a-better-browser-like-chrome) for some directions. – Jimi Aug 12 '18 at 23:06
  • Am I the only one that couldn't see how you couldn't find the data? After you make a POST request like: https://www1.gigbuilder.com/gbmusic.nsf/musiclist?open&list=getsong&unid=373119AB109D294386257DA700325738 It literally responds the same html with a special div with the class: "fancybox-wrap fancybox-desktop fancybox-type-iframe fancybox-opened" That's where all the data is. – Johni Michels Aug 19 '18 at 22:05
  • Actually if you replace the unid value in the request with the id give on each tr you'll navigate in this data. – Johni Michels Aug 19 '18 at 22:10

3 Answers3

3

There are few ways to implement what you need (considering a C# console application).

Maybe the easiest one is to use tools that interact with an instance of a browser, i.e. Selenium (used for unit tests). So:

  1. Install Selenium.WebDriver nuget package
  2. Install a browser where your application will run (let's suppose chrome)
  3. Download the browser driver (chromedriver)
  4. Write something like:

    IWebDriver driver = null;
    try
    {
        ChromeOptions options = new ChromeOptions();
        options.AddArguments("--incognito");
    
        driver = new ChromeDriver(options);
        driver.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(5);
        driver.Url = "https://www.wedj.com/dj-photo-video.nsf/firstdance.html";
    
        var musicTable = driver.FindElement(By.Id("musicTable"));
        // interact with driver to get data from the page.
     }
     finally
     {
        if (driver != null)
           driver.Dispose();
     }
    

Otherwise, you need to investigate a little bit more how the webpage works. As far as I can see, the page loads a javascript, https://www.wedj.com/dj-photo-video.nsf/musiclist.js, that it is responsible to load the list of music from server. This js script basically load data from following url: https://www.wedj.com/gbmusic.nsf/musicList?open&wedj=1&list=category_firstdance&count=100 (you can open it also in a browser). Excluding "(" and ")", the result is a json you can parse (maybe using newtonsoft.json package):

{
  "more": "yes",
  "title": "<h1>Most Requested Wedding First Dance Songs<\/h...",
  "event": "<table class='musicTable g6-table-all g6-small' id='musicTable' borde..."
}

The event property contains the data you need (you can use HtmlAgilityPack nuget package to parse it).


Pro Selenium:

  1. easy to interact with
  2. the behavior is the same of what you see by the browser

Cons Selenium:

  1. you need chrome or another browser installed
  2. the browser is running when you interact with it
  3. the browser download the full page (images, html, js, css...)

Pro manual:

  1. you load only what you need
  2. no dependencies to external programs (i.e. browsers)

Cons manual:

  1. you need to understand how html/js works
  2. you need to manually parse the json/html

In this specific case, I prefer the second options.

Sierrodc
  • 793
  • 5
  • 17
0

Read about Selenium Automation tool for C#, but it'll open every web page that you want to scrap and then e.g return source code or perform some actions on that webpage.

Generally this tool is not (afaik) for web crawlers, but can be good at the beginning, especially if your dotnet core app is sitting on some virtual machine / docker.

But care, it may be risky to open not-safe pages via browser.

UbuntuCore
  • 339
  • 4
  • 18
0

You might watn to try pupeteer sharp. It allows you to get the current HTML state.

using (var page = await browser.NewPageAsync())
{
    await page.GoToAsync("http://www.spapage.com");
    var result = await page.GetContentAsync();
}

https://github.com/kblok/puppeteer-sharp

brechtvhb
  • 947
  • 2
  • 11
  • 25