Screen Scraping PHP using preg_match

Question

I am trying to create a php script that will retrieve the WOW factor numbers (right hand side) from this webpage http://forums.moneysavingexpert.com/forumdisplay.php?f=36 and store them in variables/array.

I have looked in the source code for the page and the values (ints) appear after this code "<div style="padding: 12px 0px 0px 0px;"><strong>"

I am trying to use preg_match currently just to retrieve 1 value (before I move onto retrieving multiple values), however I am having no luck. When I perform a var_dump there is nothing stored in my array. Also - I am not sure whether or not to escape the "s in the string above. If I do then var_dump prints out

array(0) { }

If I don't then var_dump prints out

NULL

The code I am using is below:

<html>
<head>
<title>
MSE Value Extractor
</title>
</head>
<body>
<?php

echo "Welcome to MSE deal finder!\n";

$content = file_get_contents('http://forums.moneysavingexpert.com/forumdisplay.php?f=36');

preg_match('/<div style=\"padding: 12px 0px 0px 0px;\"><strong>(.)</', $content, $match);
var_dump($match);
$value = $match[1];

echo "Value obtained is $value \n";

?>

</body>
</html>

If anyone could comment on where I am going wrong, it would be hugely appreciated. I'm not that familiar with php.

Thanks in advance

There is [whitespace](http://www.php.net/manual/en/regexp.reference.escape.php) between the `
` and the ``. — mario, Dec 14 '11 at 21:02
easydomparser would let you get any data on the page, they have done the hard part for you you just have to pull the sweet data into your app. — dm03514, Dec 14 '11 at 21:02
As alternative to [regex] matching (which you seem too unexperienced for) you could utilize [QueryPath](http://stackoverflow.com/questions/tagged/querypath) and a simple `htmlqp($url)->find("div > strong")->text();`. Though matching by `style=` attributes is just as ambiguous, so you have to hope no other div/strong pairs exist. — mario, Dec 14 '11 at 21:07
See why [you should not parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — casperOne, Dec 14 '11 at 21:16
I think you should read [this](http://stackoverflow.com/a/1732454/212940) first. — vascowhite, Dec 14 '11 at 21:17
@vascowhite Indeed, this is just a bad, bad move. Just see the [accepted answer](http://stackoverflow.com/a/1732454/50776) =) — casperOne, Dec 14 '11 at 21:24
@casperOne: That joke link isn't much of an explanation. (Not to speak of relevancy [*self-contained tags?*] or [technically correct](http://stackoverflow.com/a/4234491/345031)). — mario, Dec 14 '11 at 21:25
@mario: Agreed it's not much of an explanation, but I'd argue the answer you refer to is not correct either, the reason being that [HTML is a Chomsky Type 2 grammar, while regex is a Chomsky Type 3 grammar](http://stackoverflow.com/a/1758162/50776). What you referenced was a perl program that uses regexes, not a single regex that can parse HTML (as clever as it is, you will *always* have to break it down into smaller chunks and use something *other than regex* to have a complete solution). — casperOne, Dec 14 '11 at 21:35
@vascowhite: another [joke link about regex you might not know about](http://stackoverflow.com/a/4234491/353612) — greg0ire, Dec 14 '11 at 21:49
possible duplicate of [XPATH not working on the HTML](http://stackoverflow.com/questions/6221979/xpath-not-working-on-the-html) — mario, Dec 15 '11 at 02:19

score 1 · Answer 1 · answered Dec 14 '11 at 21:02

1

I'm not sure regex is the best way of doing this, although it could certainly fit the bill.

What about using a domparser, like http://simplehtmldom.sourceforge.net/, to traverse the HTML like you can in jQuery (if you are familiar with jQuery)?

answered Dec 14 '11 at 21:02

awoods

138
5

1

I'd recommend going into using regex is not a good way of doing this; as it stands now, this answer isn't that much of an answer, and more of a comment. – casperOne Dec 14 '11 at 21:18

greg0ire · Accepted Answer · 2011-12-14T21:35:18.040

1

I don't think using the style attribute is very semantic... here is a solution using DOMDocument and an xpath query :

<?php
$doc = new DOMDocument();
/* This page gives a loooot of warnings (probably because it's 
 * Money Saving Expert, not html expert)
 * Just ignore them with an @ 
 */
@$doc
  ->loadHTMLFile('http://forums.moneysavingexpert.com/forumdisplay.php?f=36');

$xpath = new DOMXPath($doc);
/* look for strong elements in td elements with a class attribute containing
 'popularity_threadbit_column' */
$list = $xpath
  ->evaluate("//td[contains(@class, 'popularity_threadbit_column')]//strong");
echo sprintf("found %d elements :" . PHP_EOL, $list->length);
foreach ($list as $element)
{
  echo $element->nodeValue . PHP_EOL;
}

Output:

$ php wow.php
found 27 elements :
5
0
0
0
0
1
0
922
112
0
290
661
390
18
2
51
0
31
163
163
46
33
103
50
90
0
109

Now you can try to write a regular expression that does the same, but I think it will be much uglier than the xpath expression we have here!

edited Dec 14 '11 at 21:35

answered Dec 14 '11 at 21:26

greg0ire

21,120
15
68
95

@cud_programmer: Please have the courtesy to retag your question, when picking a dom answer to your regex problem. – mario Dec 14 '11 at 21:53
@mario : Why don't you do it yourself ? I don't think cud_programmer would mind it. – greg0ire Dec 14 '11 at 22:01
Nah. Not my duty. I merely wanted to complain about tag rot. (Though it's not that SO at large is actually fixable at this point). But I shall find a duplicate for this at least... – mario Dec 14 '11 at 22:13

score 0 · Answer 3 · edited Dec 15 '11 at 01:52

0

it seems like you need an * after the (.) in the regex

you can test your regular expressions here: http://www.pagecolumn.com/tool/pregtest.htm

hope this helps.

edited Dec 15 '11 at 01:52

hakre

178,314
47
389
754

answered Dec 14 '11 at 21:07

Jarry

1,780
2
14
27

Screen Scraping PHP using preg_match

3 Answers3