0

Using perl regex, I'm trying to scrape a website's html, and then match URL and version number in the following code. No matter what I used, it is not matching the string.

String to be matched: <a itemprop='downloadUrl' href='http://downloads.wordpress.org/plugin/wordfence.5.0.9.zip'>Download Version 5.0.9</a> </p>

I need to get the part of the string that forms the href= value and Version number.

I tried:

if($page =~ /.*<a itemprop='downloadUrl' href='(.*)' Download Version (.*)<\/a>/) 
        {
            $url = $1;
            $version = $2;

$page contains a block like:

<div id="plugin-description">
    <p itemprop="description" class="shortdesc">
        Wordfence Security is a free enterprise class security and performance plugin that makes your site up to 50 times faster and more secure.   </p>
    <div class="description-right">
                <p class="button">
            <a itemprop='downloadUrl' href='http://downloads.wordpress.org/plugin/wordfence.5.0.9.zip'>Download Version 5.0.9</a>       </p>
<meta itemprop="softwareVersion" content="5.0.9" />
<meta itemprop="fileFormat" content="application/zip" />

                    </div>
</div>
RobEarl
  • 7,556
  • 6
  • 31
  • 48
Joel G Mathew
  • 5,528
  • 12
  • 41
  • 74
  • 1
    Why not use an [`HTML::Parser`](https://metacpan.org/pod/HTML::Parser)? – chrsblck May 28 '14 at 16:43
  • I'd like to keep external modules to a minimum. Is that bad? – Joel G Mathew May 28 '14 at 16:43
  • Well it depends on how much data you are trying to extract. If you are only grabbing href tags, regular expressions should be sufficient, but the syntax of HTML is not a regular grammar and cannot be fully parsed by regular expressions (even in Perl). – Hunter McMillen May 28 '14 at 16:45
  • It is not bad to "keep to a minimum". But HTML parsers are much, much more robust for this kind of thing than regex. So - be flexible and recognize that rules are there to be used in the majority of cases. Here, the rule "don't parse HTML with regex" should take precedence over the rule "don't use external modules". – Floris May 28 '14 at 16:45
  • If you insist, you can improve your expression with things like `href=(\S+)` or at least `(.*?)` - the former will stop at the first white space, the latter will match "as little as possible" so you don't run into trouble if there are two links on the same page. – Floris May 28 '14 at 16:47
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Oesor May 28 '14 at 18:33
  • @Oesor [Obligatory link](http://stackoverflow.com/q/4231382/471272): please link to answer, not to non-answers. – tchrist Jun 08 '14 at 20:06

2 Answers2

3

Your regular expression is not matching because you are missing the right anchor > of your tag and you need to remove the leading space before Download.

if ($page =~ /.*<a itemprop='downloadUrl' href='(.*)'>Download Version (.*)<\/a>/)
                                                     ^^

Note: You should follow all .* with ? for a non-greedy match.

hwnd
  • 65,661
  • 4
  • 77
  • 114
1

Use an actual HTML Parser like Mojo::DOM and Mojo::UserAgent for parsing HTML. There's a nice 8 minute video on all the powers of this framework at Mojocast Episode 5.

use strict;
use warnings;

use Mojo::UserAgent;

my $url = "https://wordpress.org/plugins/wordfence/";

my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;

# Process all links
for my $link ($dom->find('a[itemprop=downloadUrl]')->each) {
    if ($link->text =~ /Download Version (.*)/) {
        print "$link->{href} -> $1";
    }
}

Outputs:

http://downloads.wordpress.org/plugin/wordfence.5.0.9.zip -> 5.0.9

Note: One flaw in your regex is that you're using greedy matching everywhere. You should change all your .* to .*?.

Miller
  • 34,344
  • 4
  • 33
  • 55