0

I am trying to create a php script that will retrieve the WOW factor numbers (right hand side) from this webpage http://forums.moneysavingexpert.com/forumdisplay.php?f=36 and store them in variables/array.

I have looked in the source code for the page and the values (ints) appear after this code "<div style="padding: 12px 0px 0px 0px;"><strong>"

I am trying to use preg_match currently just to retrieve 1 value (before I move onto retrieving multiple values), however I am having no luck. When I perform a var_dump there is nothing stored in my array. Also - I am not sure whether or not to escape the "s in the string above. If I do then var_dump prints out

array(0) { }

If I don't then var_dump prints out

NULL

The code I am using is below:

<html>
<head>
<title>
MSE Value Extractor
</title>
</head>
<body>
<?php

echo "Welcome to MSE deal finder!\n";

$content = file_get_contents('http://forums.moneysavingexpert.com/forumdisplay.php?f=36');

preg_match('/<div style=\"padding: 12px 0px 0px 0px;\"><strong>(.)</', $content, $match);
var_dump($match);
$value = $match[1];

echo "Value obtained is $value \n";

?>

</body>
</html>

If anyone could comment on where I am going wrong, it would be hugely appreciated. I'm not that familiar with php.

Thanks in advance

PeeHaa
  • 66,697
  • 53
  • 182
  • 254
cud_programmer
  • 1,204
  • 1
  • 18
  • 37
  • There is [whitespace](http://www.php.net/manual/en/regexp.reference.escape.php) between the `
    ` and the ``.
    – mario Dec 14 '11 at 21:02
  • 1
    easydomparser would let you get any data on the page, they have done the hard part for you you just have to pull the sweet data into your app. – dm03514 Dec 14 '11 at 21:02
  • As alternative to [regex] matching (which you seem too unexperienced for) you could utilize [QueryPath](http://stackoverflow.com/questions/tagged/querypath) and a simple `htmlqp($url)->find("div > strong")->text();`. Though matching by `style=` attributes is just as ambiguous, so you have to hope no other div/strong pairs exist. – mario Dec 14 '11 at 21:07
  • See why [you should not parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – casperOne Dec 14 '11 at 21:16
  • 1
    I think you should read [this](http://stackoverflow.com/a/1732454/212940) first. – vascowhite Dec 14 '11 at 21:17
  • @vascowhite Indeed, this is just a bad, bad move. Just see the [accepted answer](http://stackoverflow.com/a/1732454/50776) =) – casperOne Dec 14 '11 at 21:24
  • @casperOne: That joke link isn't much of an explanation. (Not to speak of relevancy [*self-contained tags?*] or [technically correct](http://stackoverflow.com/a/4234491/345031)). – mario Dec 14 '11 at 21:25
  • @mario: Agreed it's not much of an explanation, but I'd argue the answer you refer to is not correct either, the reason being that [HTML is a Chomsky Type 2 grammar, while regex is a Chomsky Type 3 grammar](http://stackoverflow.com/a/1758162/50776). What you referenced was a perl program that uses regexes, not a single regex that can parse HTML (as clever as it is, you will *always* have to break it down into smaller chunks and use something *other than regex* to have a complete solution). – casperOne Dec 14 '11 at 21:35
  • Thanks all :) Very grateful to everyone here! – cud_programmer Dec 14 '11 at 21:41
  • @vascowhite: another [joke link about regex you might not know about](http://stackoverflow.com/a/4234491/353612) – greg0ire Dec 14 '11 at 21:49
  • possible duplicate of [XPATH not working on the HTML](http://stackoverflow.com/questions/6221979/xpath-not-working-on-the-html) – mario Dec 15 '11 at 02:19

3 Answers3

1

I'm not sure regex is the best way of doing this, although it could certainly fit the bill.

What about using a domparser, like http://simplehtmldom.sourceforge.net/, to traverse the HTML like you can in jQuery (if you are familiar with jQuery)?

awoods
  • 138
  • 5
  • 1
    I'd recommend going into using regex is not a good way of doing this; as it stands now, this answer isn't that much of an answer, and more of a comment. – casperOne Dec 14 '11 at 21:18
1

I don't think using the style attribute is very semantic... here is a solution using DOMDocument and an xpath query :

<?php
$doc = new DOMDocument();
/* This page gives a loooot of warnings (probably because it's 
 * Money Saving Expert, not html expert)
 * Just ignore them with an @ 
 */
@$doc
  ->loadHTMLFile('http://forums.moneysavingexpert.com/forumdisplay.php?f=36');

$xpath = new DOMXPath($doc);
/* look for strong elements in td elements with a class attribute containing
 'popularity_threadbit_column' */
$list = $xpath
  ->evaluate("//td[contains(@class, 'popularity_threadbit_column')]//strong");
echo sprintf("found %d elements :" . PHP_EOL, $list->length);
foreach ($list as $element)
{
  echo $element->nodeValue . PHP_EOL;
}

Output:

$ php wow.php
found 27 elements :
5
0
0
0
0
1
0
922
112
0
290
661
390
18
2
51
0
31
163
163
46
33
103
50
90
0
109

Now you can try to write a regular expression that does the same, but I think it will be much uglier than the xpath expression we have here!

greg0ire
  • 21,120
  • 15
  • 68
  • 95
  • @cud_programmer: Please have the courtesy to retag your question, when picking a dom answer to your regex problem. – mario Dec 14 '11 at 21:53
  • @mario : Why don't you do it yourself ? I don't think cud_programmer would mind it. – greg0ire Dec 14 '11 at 22:01
  • Nah. Not my duty. I merely wanted to complain about tag rot. (Though it's not that SO at large is actually fixable at this point). But I shall find a duplicate for this at least... – mario Dec 14 '11 at 22:13
0

it seems like you need an * after the (.) in the regex

you can test your regular expressions here: http://www.pagecolumn.com/tool/pregtest.htm

hope this helps.

hakre
  • 178,314
  • 47
  • 389
  • 754
Jarry
  • 1,780
  • 2
  • 14
  • 27