1

I want to scrape a website that has 3 levels of scraping. firs I get all pages, then in each page, I get image, title, url that redirects me to a unique page which contains more info like description , date ,... . so if I use foreach it gives me the false results and if I use for instead of foreach it returns just one object. how can I handle if(use for instead of foreach);

<?php 
   $stackHref=array();
   $eventDetail=array();

$sitecontent =  file_get_contents('https://www.everfest.com/music/edm-festivals');


    if($sitecontent === FALSE) {

        $error_log .= 'Error on  $sitecontent =  file_get_contents(https://www.everfest.com/music/edm-festivals) ';
        //insert_error($error_log);
    }
   //  echo $sitecontent;
   $dom = new DOMDocument();
   libxml_use_internal_errors(true);
   $dom->loadHTML($sitecontent);
   libxml_use_internal_errors(false);
   $xpath = new DOMXPath($dom);     
   $nodes = $xpath->query("(//ul[@class='pagination'])[1]/li/a/@href");
   // $all_area_set= ' ';
   //echo $sitecontent;
  if(!isset($nodes))
  {
       $error_log .= "Error on $nodes = $xpath->query((//ul[@class='pagination'])[1]/li/a/@href)";  
       //insert_error($error_log);
       echo $error_log;
  }

  // get total pages

  foreach ($nodes as $link) {
      $stackHref[]='https://www.everfest.com'.$link->nodeValue;      
  }

  //loop through each pages in order to scrape 
  $j=0;
  for($i=0;$i<count($stackHref);$i++){

      $sitecontent=file_get_contents($stackHref[$i]);

      if($sitecontent === FALSE) {

        $error_log .= 'Error on  $sitecontent =  file_get_contents(https://www.everfest.com/music/edm-festivals) ';
        //insert_error($error_log);
      }

      $dom= new DOMDocument();
      libxml_use_internal_errors(TRUE);
      $dom->loadHTML($sitecontent);
      libxml_use_internal_errors(FALSE);
      $innerXpath= new DOMXPath($dom);



              //get page link

              $pageLinks= $innerXpath->query('//div[@class="festival-card grow"]/a[1]/@href');

                  for ($a=0;$a <$pageLinks->length;$a++ ){
                      //get img src
                        $eventDetail[$j]['pagelink']='https://www.everfest.com'.$pageLinks[$a]->nodeValue;
                        $images= $innerXpath->query("//div[contains(@class,'columns medium-6 large-4')]/div[contains(@class,'grow')]/a/img/@src");
                      $eventDetail[$j]['img']=$images[$a]->nodeValue;
                          //get title
                           $titles= $innerXpath->query("//div[contains(@class,'clearfix')]/a[1]/text()");
                              $eventDetail[$j]['title']=$titles[$a]->nodeValue;

                                      // go inside of each pages in order to get description, date, venue
                                      $sitecontent=file_get_contents($eventDetail[$j]['pagelink']);
                                      $dom= new DOMDocument();
                                      libxml_use_internal_errors(TRUE);
                                      $dom->loadHTML($sitecontent);
                                      libxml_use_internal_errors(FALSE);
                                      $deepxpath= new DOMXPath($dom);
                                      $descriptions= $deepxpath->query('//div[@class="columns"]/div[contains(@class,"card-white")]/p[contains(@class,"")]/span[1]/following-sibling::text()[1]');

                                          $eventDetail[$j]['description']=$descriptions[$a]->nodeValue;

                                      //get date
                                      $dates= $deepxpath->query('//div[@id="signup"]/div[@class="row"]/div[contains(@class,"columns")][1]/p/text()[1]');

                                          $eventDetail[$j]['Date']=$dates[$a]->nodeValue;

                                      //get venue
                                      $venues= $deepxpath->query('//div[@id="signup"]/div[@class="row"]/div[contains(@class,"columns")][1]/p/text()[2]');

                                          $eventDetail[$j++]['venue']=$venues[$a]->nodeValue;



                                      }
                                      }     
?>
john john
  • 95
  • 4
  • Are you receiving any errors? Check log files. I see an immediate one as you are attempting to index DOMXPath query results: `$pageLinks[$a]` which should error out as: `PHP Fatal error: Cannot use object of type DOMNodeList as array`. – Parfait Aug 14 '16 at 15:14
  • @Parfait, no, I don't get any errors. for the first array, I mean if `$a=0` it returns correct value, when `$a` is increased, title, pagelink, img also have their proper values though description, date, venue are not initiated . – john john Aug 15 '16 at 08:25
  • That is very interesting! I literally copied your entire code and ran it on my end and erred on line 61 pointing to the `$pageLinks[$a]` with same message as above. Curious, which version of PHP are your running? Maybe PHP 7 resolved this item. I use PHP 5.4. – Parfait Aug 15 '16 at 14:30
  • @Parfait. I use 5.6.11 of php – john john Aug 16 '16 at 06:47
  • Do you have error reporting turned on? Check [.ini file](http://stackoverflow.com/questions/1053424/how-do-i-get-php-errors-to-display). – Parfait Aug 17 '16 at 17:16
  • @Parfait yes it's turned on – john john Aug 18 '16 at 07:33

0 Answers0