0

I am trying to create a a basic web crawler the specifically looks for links from adverts.

I have managed to find a script that uses cURL to get the contents of the target webpage

I also found one that uses DOM

<?php
    $ch = curl_init("http://www.nbcnews.com");
    $fp = fopen("source_code.txt", "w");

    curl_setopt($ch, CURLOPT_FILE, $fp);
    curl_setopt($ch, CURLOPT_HEADER, 0);

    curl_exec($ch);
    curl_close($ch);
    fclose($fp);
?>

These are great and I certainly feel like I'm heading in the right direction except quite a few adverts are displayed using JS and as it's client side, it obviously isn't processed and I only see the JS code and not the ads.

Basically, is there any way of getting the JS to execute before I start trying to extract the links?

Thanks

Ed Bangga
  • 11,622
  • 4
  • 10
  • 29
  • 5
    I don't know if there are javascript engine's written in php, but you can do what you want to achieve using phantomjs, which is a headless programmable browser. – Kemal Dağ Sep 04 '13 at 20:21
  • @KemalDağ You should add that as an answer. Short but correct. – Gavin Sep 04 '13 at 21:21
  • is there anyway of rendering the java script serverside? php or not? – Jonathan Rosenfeld Sep 04 '13 at 21:24
  • 2
    Crawling ad links sounds like using an automated service to increase your revenue which will be against the terms of service of all ad services. In other words, you should not be doing it. They put it in javascript for a reason: to prevent you from doing it. – developerwjk Sep 04 '13 at 21:31
  • @developerwjk, I'm unsure what lead you to that assumption but you are very wrong. You'll also find that ad networks will filter out link hits from crawlers. I'm more interested in finding out who is advertising on certain sites and for how long. – Jonathan Rosenfeld Sep 04 '13 at 21:45

0 Answers0