0

I am try to capture "Coffret bougies P'tits biscuits" in a HTML parsed page. I have got the page into a $page2 variable and I want to retrieve each of the product names there are. Here is the regex I have come to :

/<div\sclass=\"details\">\n\s*<h3>(.*)<a\shref=\"(.*)\">(.*)<\/a>/

and the code

preg_match_all('/<div\sclass=\"details\">\n\s*<h3>(.*)<a\shref=\"(.*)\">(.*)<\/a>/', $page2, $matcher);
    print_r($matcher);

this is supposed to capture all of the HTML code that looks like this :

<div class="details">
    <h3><a href="/FR/fr/produits/fiche/coffret-bougies-ptits-biscuits-138156.htm">Coffret bougies P'tits biscuits</a>

https://www.debuggex.com/r/ddZCM3K_GQ4PkIPV/0

But for some reason I don't understand it keep returning an empty array.

Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) [3] => Array ( ) )

justberare
  • 901
  • 1
  • 8
  • 26
  • 3
    First of all, if you're not smooth with regexes then you might just try to use [an html parser](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php). Second, you need to use ungreedy patterns. So `.*?` instead of `.*`. [Read the difference](http://stackoverflow.com/questions/3075130/difference-between-and-for-regex). Finally, remove `\n` since it's already included in `\s`, also who knows maybe there is `\r\n` ? *bon appétit* – HamZa Dec 26 '13 at 19:59
  • Thanks a lot for your help. I am actually using an html parser for some other part of the code, but for this the html code is not regular enough for it to work. That's why I use regex. So my new line is / – justberare Dec 26 '13 at 20:05
  • 2
    @justberare: this should be very easily fetched via DOM parser. I don't know why you are trying this with regex, since you already know how to do that with dom? – Glavić Dec 26 '13 at 20:06
  • Then I have to admit that I don't manage to fetch the product name specifically. I managed to do so on other elements. But I can't get to the child's child textContext of the div with class=details. It always comes back empty – justberare Dec 26 '13 at 20:09
  • 1
    @justberare Something [reaaaal quick](http://regex101.com/r/vX0wR8). Use different delimiters than `/` and use `xs` modifiers. PS: don't tell anyone that you got it from me :P – HamZa Dec 26 '13 at 20:09
  • 1
    @HamZa thanks a lot. it improves the other regex stuff I had in my code. Thanks ! but now I fell again for the parse solution ;) – justberare Dec 26 '13 at 20:15
  • 1
    Just for the record, there is a simple explanation why your debuggex attempt failed: you forgot to remove the trailing slash from the pattern. – Ruud Helderman Dec 26 '13 at 20:23

1 Answers1

2

A DOM way:

<pre><?php

$html = <<<EOD
<div class="details">
    <h3><a href="/FR/fr/produits/fiche/coffret-bougies-ptits-biscuits-138156.htm">Coffret bougies P'tits biscuits</a>
EOD;

@$dom = DOMDocument::loadHTML($html);
$xpath = new DOMXPath($dom);
$links = $xpath->query('//div[contains(@class,"details")]/h3/a');

foreach ($links as $link) {
    printf("<br>%s<br>%s", $link->nodeValue, $link->getAttribute('href'));  
}

?></pre>
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113