0

Going crazy over here, trying to fetch html-links from an html sourcefile. What am I missing? I've tested the regular expression it works fine, but here it returns nothing at all (tried different websites as well)

#!/usr/bin/perl

use LWP::Simple;
my $url = "http://www.svenskaspel.se";
my $content = get($url);
my @links = ();

$content =~ /<a[^>]* href=([^"]*)/;
push (@links, $_);

foreach $_ (@links){
    print "$_\n";

    }
kaktusräv
  • 69
  • 5

4 Answers4

2

Seriously - don't. HTML parsing via regex is a dirty hack, and creates brittle code. RegEx match open tags except XHTML self-contained tags

Here's an example of some alternative approaches: How do I extract links from HTML with a Perl regex?

Community
  • 1
  • 1
Sobrique
  • 51,581
  • 6
  • 53
  • 97
0

The lines:

$content =~ /<a[^>]* href=([^"]*)/;
push (@links, $_);

must be:

$content =~ /<a[^>]* href="([^"]*)/;
push (@links, $1);
Toto
  • 83,193
  • 59
  • 77
  • 109
0

Your regex seems incorrect.

check this link - http://regexr.com/3ajeh to see the working regular expression. paste source of any HTML page in text area to test.

Please note that regular expression is not a suggested way of parsing HTML as HTML is not a regular language.

See this famous answer.

Community
  • 1
  • 1
Heisenberg
  • 1,412
  • 3
  • 15
  • 33
0

Unless the HTML file is extraordinarily simple (a list of links) you should probably avoid parsing it yourself as has been mentioned. In this answer I'll suggest that you can "cheat" and install something from CPAN to help :-)

e.g. you could use Mojolicious - specifically the mojo tool that is included with that module:

mojo get https://www.svenskaspel.se a attr href

which in "long form" is something like:

perl -Mojo -E ' my $ua = Mojo::UserAgent->new; 
      say $ua->get("https://www.svenskaspel.se")
     ->res->dom->find("a[href]")->map(attr => 'href')->join("\n");'

The longer one-liner outputs:

/
/
/spela
/mina-spel


/bomben
#
/stryktipset/tipssm
/triss
/grasroten
/spelkoll
/kundservice
/om-cookies

which includes blank lines because some of the href attributes have no content (href="").

You can control the selector using the matching syntax from Mojo::DOM SELECTORS. That way, similar to DOM CSS selectors, something like: ...->dom->find("a[href^=/]") would look for values of href attributes that begin with "/".

G. Cito
  • 5,886
  • 3
  • 27
  • 41