-1

I have used http://www.regexr.com/ to try to learn about the regex sintaxis which I am applying with PHP. However, I'm sure there is a better way to write this expression:

(?:\"price|price\")+(?:[^\>])*(?:\>)+((?:[^\>](?!\/))+)+(?:[^\>])*(?:\>)*([^\<]*(?!\/\>))

I am trying to retrieve the price values for the following text:

A     <span class="price-sales">$80.00</span>

B <div class="ProdMargin"><font class="items_price" >€19,75</font></div> 
C <div class="price" id="text-price"> foo
<span >EUR 149 €</span>

        </div>
D <div class="price" id="text-foo"> <span >149 €</span></div>
E <div id="text-price" id="foo"> <span >149 EUR</span></div>
F <div class="foo">bar</div>

Desired mathes are:

  • A $80.00
  • B €19,75
  • C EUR 149 €
  • D 149 €
  • E 149 EUR

The main issue is that I have had to create 2 "matching groups": (A,B) one for for ordinary matches and (C,D,E) values that are in a second degree child.

Questions:

  • 1) Am I doing anything wrong? or can it be improved?
  • 2) Can I get just one outcoming "match group"?

Much appreciated!

James
  • 391
  • 1
  • 5
  • 17

2 Answers2

1

Would something like this work?

/(\$|€|EUR)? *([0-9,]+(\.[0-9]{1,2})?) *(\$|€|EUR)?/

[EDIT]

In that case, I don't think a regular expression would be best. Try using a DOM parser. PHP has one built-in. Here's a starting point: Getting DOM elements by classname

Community
  • 1
  • 1
Nosrac
  • 144
  • 6
  • Unfortunately not, the format can be of any currency (either symbol or letters) in front or behind the number (which might also be in different formats). So what I was going for was getting the content of classes or id that begin or end with "price" – James May 10 '14 at 16:50
  • I guess I could do something like this `((\$|€|EUR)+ *([0-9,]+(\.[0-9]{1,2})?))|(([0-9,]+(\.[0-9]{1,2})?) *(\$|€|EUR)+)` to make sure it captures the sign before and after... i would have to dump all the currency symbols in the pregmatch though, is that viable? – James May 10 '14 at 17:11
  • Sure. Store it in a variable so you only maintain one list and you should be good to go – Nosrac May 10 '14 at 17:16
1

HTML is not a regular language and cannot be reliably parsed using regular expressions. Use a DOM parser instead. Here's a solution using PHP's built-in DOMDocument class:

$html = <<<HTML
<span class="price-sales">$80.00</span>
<div class="ProdMargin"><font class="items_price" >€19,75</font></div> 
<div class="price" id="text-price"> foo<span >EUR 149 €</span></div>
<div class="price" id="text-foo"> <span >149 €</span></div>
<div id="text-price" id="foo"> <span >149 EUR</span></div>
HTML;

// Escape entites correctly
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');

$dom = new DOMDocument;

// Disable errors about the markup
libxml_use_internal_errors(true);

$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

// Find innermost nodes
$nodes = $xpath->query('//*[not(descendant::*)]');

// Loop through the nodes and add items to the array
foreach ($nodes as $node) {
    $results[] = $node->nodeValue;
}

var_dump($results);

Output:

array(5) {
  [0]=>
  string(6) "$80.00"
  [1]=>
  string(8) "€19,75"
  [2]=>
  string(11) "EUR 149 €"
  [3]=>
  string(7) "149 €"
  [4]=>
  string(7) "149 EUR"
}

Demo

Amal Murali
  • 70,371
  • 17
  • 120
  • 139
  • Thanks for the suggestion. May I ask how you are filtering the prices from that code? I don't want every single node of an html, just the ones containing a "price". See my updated code. – James May 10 '14 at 17:14
  • @James: What do you mean "containing"? Anywhere in the line? Or anywhere in any of the parent nodes? – Amal Murali May 10 '14 at 17:16
  • @James: If you're trying to find the nodes which has `price` anywhere in its `id` attribute, you can easily achieve it with an XPath expression. Is that what you're looking for? – Amal Murali May 10 '14 at 17:59