Setting the regex right (with Debuggex)

Question

I am try to capture "Coffret bougies P'tits biscuits" in a HTML parsed page. I have got the page into a $page2 variable and I want to retrieve each of the product names there are. Here is the regex I have come to :

/<div\sclass=\"details\">\n\s*<h3>(.*)<a\shref=\"(.*)\">(.*)<\/a>/

and the code

preg_match_all('/<div\sclass=\"details\">\n\s*<h3>(.*)<a\shref=\"(.*)\">(.*)<\/a>/', $page2, $matcher);
    print_r($matcher);

this is supposed to capture all of the HTML code that looks like this :

<div class="details">
    <h3><a href="/FR/fr/produits/fiche/coffret-bougies-ptits-biscuits-138156.htm">Coffret bougies P'tits biscuits</a>

https://www.debuggex.com/r/ddZCM3K_GQ4PkIPV/0

But for some reason I don't understand it keep returning an empty array.

Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) [3] => Array ( ) )

First of all, if you're not smooth with regexes then you might just try to use [an html parser](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php). Second, you need to use ungreedy patterns. So `.*?` instead of `.*`. [Read the difference](http://stackoverflow.com/questions/3075130/difference-between-and-for-regex). Finally, remove `\n` since it's already included in `\s`, also who knows maybe there is `\r\n` ? *bon appétit* — HamZa, Dec 26 '13 at 19:59
Thanks a lot for your help. I am actually using an html parser for some other part of the code, but for this the html code is not regular enough for it to work. That's why I use regex. So my new line is /
\r\n\s*
(.*?)(.*?)/ but it still doesn't work .. :( — justberare, Dec 26 '13 at 20:05
@justberare: this should be very easily fetched via DOM parser. I don't know why you are trying this with regex, since you already know how to do that with dom? — Glavić, Dec 26 '13 at 20:06
Then I have to admit that I don't manage to fetch the product name specifically. I managed to do so on other elements. But I can't get to the child's child textContext of the div with class=details. It always comes back empty — justberare, Dec 26 '13 at 20:09
@justberare Something [reaaaal quick](http://regex101.com/r/vX0wR8). Use different delimiters than `/` and use `xs` modifiers. PS: don't tell anyone that you got it from me :P — HamZa, Dec 26 '13 at 20:09
@HamZa thanks a lot. it improves the other regex stuff I had in my code. Thanks ! but now I fell again for the parse solution ;) — justberare, Dec 26 '13 at 20:15
Just for the record, there is a simple explanation why your debuggex attempt failed: you forgot to remove the trailing slash from the pattern. — Ruud Helderman, Dec 26 '13 at 20:23

score 2 · Accepted Answer · answered Dec 26 '13 at 20:08

2

A DOM way:

<pre><?php

$html = <<<EOD
<div class="details">
    <h3><a href="/FR/fr/produits/fiche/coffret-bougies-ptits-biscuits-138156.htm">Coffret bougies P'tits biscuits</a>
EOD;

@$dom = DOMDocument::loadHTML($html);
$xpath = new DOMXPath($dom);
$links = $xpath->query('//div[contains(@class,"details")]/h3/a');

foreach ($links as $link) {
    printf("<br>%s<br>%s", $link->nodeValue, $link->getAttribute('href'));  
}

?></pre>

answered Dec 26 '13 at 20:08

Casimir et Hippolyte

83,228
5
85
113

Eh, so you quit the regex road :?) – HamZa Dec 26 '13 at 20:13
Just understood the power of this : div[@class="details"]/h3/a no more children of child of .. THanks ! – justberare Dec 26 '13 at 20:16
@HamZa: No, I am reinstalling a laptop, and I am setting up a virtualhost for each kind of subject. – Casimir et Hippolyte Dec 26 '13 at 20:18

Setting the regex right (with Debuggex)

(.?)(.?)/ but it still doesn't work .. :(

1 Answers1

Setting the regex right (with Debuggex)

(.*?)(.*?)/ but it still doesn't work .. :(

1 Answers1

(.?)(.?)/ but it still doesn't work .. :(