4

I'm trying to get the string "hinson lou ann" out of:

 <div class='owner-name'>hinson lou ann</div>

When I run the following:

$html = "http://gisapps.co.union.nc.us/ws/rest/v2/cm_iw.ashx?gid=12339";
$doc  = new DOMDocument();
$doc->loadHTMLFile($html);
$xpath    = new DOMXpath($doc);
$elements = $xpath->query("*/div[@class='owner-name']");
if (!is_null($elements)) {
    foreach ($elements as $element) {
        echo "<br/>[" . $element->nodeName . "]";
        $nodes = $element->childNodes;
        foreach ($nodes as $node) {

            echo $node->nodeValue . "\n";
        }
    }
}

I get an error of:

Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: htmlParseEntityRef: no name in http://gisapps.co.union.nc.us/ws/rest/v2/cm_iw.ashx?gid=12339, line: 1 in /home... on line ...

Which refers to the line of loadHTMLFILE.

Note: The file is not valid HTML it only contains div tags! What is I loaded the file and then slapped the HTMLbody tag on it?

kenorb
  • 118,428
  • 63
  • 588
  • 624
tyler
  • 1,203
  • 2
  • 12
  • 37
  • First of all, that output is not valid html. – Rob W Jun 27 '13 at 20:35
  • try `$html = file_get_contents('http://gisapps.co.union.nc.us/ws/rest/v2/cm_iw.ashx?gid=12339');` and then `$doc->loadHTMLFile($html);` ... this is how I scrape my webpage's at least – brendosthoughts Jun 27 '13 at 20:36
  • Yea but @RobW is right its not valid html....nothing but div tags! any ideas – tyler Jun 27 '13 at 20:37
  • 1
    There's your problem: `HINSON J MARK & WF LOU ANN G`... `&` starts an entity, a bare `&` should be `&`. Ah well, `$doc->recover=true;` and all is _'wellish'_ (provided you use `//div[@class='owner-name']` rather then `*/div[@class='owner-name']` as it magically creates elements to make it actual HTML). – Wrikken Jun 27 '13 at 20:45
  • Next time please ask a question as well. You have not asked what you're concerned about. So users can only guess which question you issue caused you. Just by the error message alone it's hard to say. – hakre Jun 29 '13 at 10:04

4 Answers4

9

If you really must try to parse it, try this:

<?php
$html = file_get_contents("http://gisapps.co.union.nc.us/ws/rest/v2/cm_iw.ashx?gid=12339");
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->recover=true;
@$doc->loadHTML("<html><body>".$html."</body></html>");

$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*/div[@class='owner-name']");

if (!is_null($elements)) {
   foreach ($elements as $element) {
      echo "<br/>[". $element->nodeName. "]";
      $nodes = $element->childNodes;
      foreach ($nodes as $node) {
         echo $node->nodeValue. "\n";
     }
   }
 }
?>

PS: Your XPath was wrong, I fixed it. Your $nodes won't have anything because that DIV element (.owner-name) doesn't have any children.. so you'll need to revise that.

Rob W
  • 8,825
  • 1
  • 26
  • 48
3

Just build an HTML document from the source, wrapping it in the missing elements should do the trick.

For example:-

<?php
$html = file_get_contents('http://gisapps.co.union.nc.us/ws/rest/v2/cm_iw.ashx?gid=12339');
$html = sprintf('<html><head><title></title></head><body>%s</body></html>', $html);

$doc = new DOMDocument;
$doc->loadHTML($html);

$xpa    = new DOMXPath($doc);
$divs   = $xpa->query('//div[@class="owner-name"]');

foreach($divs as $div) {
    echo $div->nodeValue, PHP_EOL;
}

/*
    hinson lou ann
*/
Anthony Sterling
  • 2,431
  • 14
  • 10
3

You are getting the error because the HTML you load contains the & character without being a valid HTML entity. The Name of the entity is mising:

... <td>HINSON J MARK & WF LOU ANN G</td> ...
                      ^

On loading such documents, you will see an error then in these cases (as you wrote):

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: no name

The name relates to the name of a HTML Entity (reference), by the pattern:

&name;
 ^^^^

However this error does not cause any problem to actually load that HTML. DOMDocument deals fine with this (common) error (You might experience a cut-off at the problematic position however).

So your assumption that you need to wrap that file into a <body> tag is wrong. In HTML the <body> tag is optional.

Your concrete problem was that you were not able to understand how to debug the HTML file after you had loaded it in. Just use the saveHTML method to output what could be successfully loaded. Doing so would have already shown you, that the URL was successfully loaded.

Which then would have guided you to the point next that the Xpath expression was wrong:

*/div[@class='owner-name']

Albeit your nose about the <body> tag was not that far off: Even that HTML fragment does not contain the <body> tag, the DOM will have it! Albeit it's two tags inside:

body/*/*/div[@class='owner-name']

Most often the short form is to use // which allows to not specifically express at which depth-level the tag is located:

//div[@class='owner-name']

See as well:

Community
  • 1
  • 1
hakre
  • 178,314
  • 47
  • 389
  • 754
1

The remote site may return invalid HTML which causes this warning. DOMDocument and DOMXPath are very forgiving in case of HTML errors. If there is just a warning after calling DOMDocument::loadHTML() and the rest of the code produces valid results, I would advice you to suppress the warnings using the silence operator @:

$doc = new DOMDocument();

// suppress warnings
$ret = @$doc->loadHTML($html);

// but check errors ...
if($ret === FALSE) {
    die('Parse error');
}
hek2mgl
  • 133,888
  • 21
  • 210
  • 235
  • I tried this with no luck. The file is not valid HTML it only contains div tags – tyler Jun 27 '13 at 20:39
  • @Wrikken Can you explain? I expect `recover` being `true` by default – hek2mgl Jun 27 '13 at 20:45
  • @hek2mgl: well I'll be.. Note: it's not `true` by default (at least not here), but setting (or not setting) it here doesn't do that much good, I was overly hasty, my apologies :) (the actual problem here is DOMDocument magically creates the html & body tags make alterations to the xpath necessary). – Wrikken Jun 27 '13 at 20:53
  • @Wrikken thx for explanation. :) I have to admit that I was too lazy to have a closer look at the xpath expression.. – hek2mgl Jun 27 '13 at 21:14