Perl Regex for Not HTML

Question

I am looking to substitute anything that is not an HTML tag from an HTML document. So, basically trying to get rid of all the text within the document.

I have the below regex to remove all HTML from a string, but need help with the opposite scenario.

$string =~ s/<[^>]+>//g;

Thanks.

NooooooooOOOOooOOOOoooOOoooo!!!! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — meda, Oct 22 '13 at 23:00
Please don't do this. This is the way to [madness](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454) — , Oct 22 '13 at 23:02
What is not an HTML tag in an HTML document? If it's well-formed, everything except comments goes inside a tag of some sort. Are you looking for text inside the body not inside another tag? — Ethan Brown, Oct 22 '13 at 23:12
@Ethan Brown: Yes, looking to eliminate the text that is not within an HTML tag. — user333746, Oct 22 '13 at 23:19
You didn't really answer my question. For example, if this is your document: `Here's some bold text!`, are you looking for the strings "Here's some " and " text!"? Because neither of those strings are outside of an HTML tag (they're both inside the `` tag). — Ethan Brown, Oct 22 '13 at 23:23
Well everyone will say the same. Dont use regex to parse html! It can be done. You don't have many guarantees in programming, and regex can't be guaranteed to work properly with html. That said check out (one of) perl's html [parser's](http://search.cpan.org/dist/HTML-Parser/Parser.pm) — gwillie, Oct 23 '13 at 00:08
@Ethan Brown: I am looking for . So anything inside <> and not outside it. Sorry for the poorly worded question, but I figured it out, anyways. Thanks. — user333746, Oct 23 '13 at 05:45
If you *must* use regexp, Regexp::Common (or others) would be a good starter. Sadly it doesn't support HTML but this is "forthcoming". — ashley, Oct 24 '13 at 07:04
@meda [Obligatory link](http://stackoverflow.com/q/4231382/471272): please link to actual answers, not to non-answers. — tchrist, Jun 08 '14 at 20:11

score 1 · Answer 1 · edited Apr 05 '14 at 13:46

Ethan Brown namechecks HTML::DOM as if it were the only CPAN solution.

HTML::Parser is more ubiquitous, but it's not hard to Google for more.

http://metacpan.org/pod/HTML::Parser

A solution using HTML::Parser is (tested once):

use HTML::Parser ();

my $p = HTML::Parser->new(api_version => 3);
$p->handler( text => sub { }, "");
$p->handler( default => sub { print shift }, "text");
$p->parse_file('content.html') || die $!;

optional · Answer 2 · 2013-10-24T07:52:29.683

1

If this is regex s///ubstitution to remove all html from document

$string =~ s/<[^>]+>//g;

Then you can use the same regex in a m//atch operator to keep all html from document

$string = join '', $string =~ m/<[^>]+>/g;

If the above regex satisfies your requirements, then you're done :) But maybe you want to consider this ol' regex pattern, slightly longer :D http://perlmonks.org/?node_id=161281 Mind the caveats like Ethan Browne mentions :)

edited Oct 24 '13 at 07:52

answered Oct 24 '13 at 07:41

optional

2,043
10
16

This idea (extracting all tags) is better than deleting anything between tags. However, your regex fails for `` → ` – amon Oct 24 '13 at 07:56
:) you already said that amon, its the OPs regex unchanged :) – optional Oct 24 '13 at 08:21

score 0 · Accepted Answer · answered Oct 22 '13 at 23:45

0

Are you looking for this?

$string =~ s/>[^<]*</></mg;

Or this?

$string =~ s/(?<=>)[^<]*(?=<)//mg;

answered Oct 22 '13 at 23:45

traybold

434
4
4

3

Your solution fails on comments like `
` → `-->
` and on `script` tags like `` → ``. Also, text at the end of a document without explicit head or body isn't removed: `
Headline
Text until EOF` → `
Text until EOF`
– amon Oct 23 '13 at 07:47

score 0 · Answer 4 · answered Oct 24 '13 at 08:28

LibXML makes it easy to select stuff that isn't tags/comments/processing-instruction and remove it

#!/usr/bin/perl --
use strict;
use warnings;
use XML::LibXML 1.70; ## for load_html/load_xml/location
use XML::LibXML::PrettyPrint;

Main( @ARGV );
exit( 0 );
sub Main {
    binmode STDOUT;
    my $loc = shift or die "
Usage:
    $0  ko00010.html
    $0  http://example.com/ko00010.html\n\n";

    my $dom = XML::LibXML->new(
        qw/
          recover 2
          no_blanks 1
          /
    )->load_html( location => $loc, );

## http://www.w3.org/TR/xpath/#node-tests
## http://www.w3.org/TR/xpath/#NT-NodeType
## http://www.w3.org/TR/xpath/#section-Text-Nodes
    for my $text ( $dom->findnodes(q{ //text() }) ){
        node_detach( $text );
    }


    local $XML::LibXML::skipXMLDeclaration = 1; ## <?xml ?>
    local $XML::LibXML::setTagCompression = 0;  ## <p />

#~     print "$dom";

    my $pp  = XML::LibXML::PrettyPrint->new_for_html;
    $pp->{indent_string}=' ';
    print $pp->pretty_print( $dom );
}
sub node_detach {
    my( $self ) = @_;
    $self->parentNode->removeChild( $self );
}

It's worth noting that any compliant DOM-based solution will wrap the HTML fragment inside a minimal `...` fragment. This parser also sticks to HTML4 semantics (in contrast to HTML5), and will introduce closing tags where there weren't any in our input. — amon, Oct 24 '13 at 08:43

Perl Regex for Not HTML

4 Answers4

Headline