Regular expression in HTML

Question

Going crazy over here, trying to fetch html-links from an html sourcefile. What am I missing? I've tested the regular expression it works fine, but here it returns nothing at all (tried different websites as well)

#!/usr/bin/perl

use LWP::Simple;
my $url = "http://www.svenskaspel.se";
my $content = get($url);
my @links = ();

$content =~ /<a[^>]* href=([^"]*)/;
push (@links, $_);

foreach $_ (@links){
    print "$_\n";

    }

[Don't use regex for this](http://blog.codinghorror.com/parsing-html-the-cthulhu-way/), get a proper parser like [HTML::TreeBuilder::LibXML](https://metacpan.org/pod/HTML::TreeBuilder::LibXML) — Quentin, Mar 12 '15 at 12:42
Please consider bookmarking the [Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496) for future reference. And yes, [do not use regex to parse html](http://stackoverflow.com/a/1732454) — aliteralmind, Mar 12 '15 at 13:03
Use a parser, but keep in mind that XML and HTML are different things. — Sinan Ünür, Mar 12 '15 at 14:58

score 2 · Answer 1 · edited May 23 '17 at 11:43

2

Seriously - don't. HTML parsing via regex is a dirty hack, and creates brittle code. RegEx match open tags except XHTML self-contained tags

Here's an example of some alternative approaches: How do I extract links from HTML with a Perl regex?

edited May 23 '17 at 11:43

Community

1
1

answered Mar 12 '15 at 12:58

Sobrique

51,581
6
53
97

What happened to closing duplicates? – Sinan Ünür Mar 12 '15 at 14:59

score 0 · Answer 2 · answered Mar 12 '15 at 12:45

0

The lines:

$content =~ /<a[^>]* href=([^"]*)/;
push (@links, $_);

must be:

$content =~ /<a[^>]* href="([^"]*)/;
push (@links, $1);

answered Mar 12 '15 at 12:45

Toto

83,193
59
77
109

Still not working, sorry. – kaktusräv Mar 12 '15 at 12:47
@kaktusräv: see my edit. – Toto Mar 12 '15 at 12:48

score 0 · Answer 3 · edited May 23 '17 at 11:56

0

Your regex seems incorrect.

check this link - http://regexr.com/3ajeh to see the working regular expression. paste source of any HTML page in text area to test.

Please note that regular expression is not a suggested way of parsing HTML as HTML is not a regular language.

See this famous answer.

edited May 23 '17 at 11:56

Community

1
1

answered Mar 12 '15 at 12:54

Heisenberg

1,412
3
15
33

G. Cito · Answer 4 · 2015-03-13T13:51:49.307

Unless the HTML file is extraordinarily simple (a list of links) you should probably avoid parsing it yourself as has been mentioned. In this answer I'll suggest that you can "cheat" and install something from CPAN to help :-)

e.g. you could use Mojolicious - specifically the mojo tool that is included with that module:

mojo get https://www.svenskaspel.se a attr href

which in "long form" is something like:

perl -Mojo -E ' my $ua = Mojo::UserAgent->new; 
      say $ua->get("https://www.svenskaspel.se")
     ->res->dom->find("a[href]")->map(attr => 'href')->join("\n");'

The longer one-liner outputs:

/
/
/spela
/mina-spel


/bomben
#
/stryktipset/tipssm
/triss
/grasroten
/spelkoll
/kundservice
/om-cookies

which includes blank lines because some of the href attributes have no content (href="").

You can control the selector using the matching syntax from Mojo::DOM SELECTORS. That way, similar to DOM CSS selectors, something like: ...->dom->find("a[href^=/]") would look for values of href attributes that begin with "/".

Regular expression in HTML

4 Answers4