0

I am looking to substitute anything that is not an HTML tag from an HTML document. So, basically trying to get rid of all the text within the document.

I have the below regex to remove all HTML from a string, but need help with the opposite scenario.

$string =~ s/<[^>]+>//g;

Thanks.

zero323
  • 283,404
  • 79
  • 858
  • 880
user333746
  • 2,085
  • 3
  • 15
  • 11
  • 10
    NooooooooOOOOooOOOOoooOOoooo!!!! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – meda Oct 22 '13 at 23:00
  • 5
    Please don't do this. This is the way to [madness](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454) –  Oct 22 '13 at 23:02
  • 1
    What is not an HTML tag in an HTML document? If it's well-formed, everything except comments goes inside a tag of some sort. Are you looking for text inside the body not inside another tag? – Ethan Brown Oct 22 '13 at 23:12
  • @Ethan Brown: Yes, looking to eliminate the text that is not within an HTML tag. – user333746 Oct 22 '13 at 23:19
  • 2
    You didn't really answer my question. For example, if this is your document: `Here's some bold text!`, are you looking for the strings "Here's some " and " text!"? Because neither of those strings are outside of an HTML tag (they're both inside the `` tag). – Ethan Brown Oct 22 '13 at 23:23
  • Well everyone will say the same. Dont use regex to parse html! It can be done. You don't have many guarantees in programming, and regex can't be guaranteed to work properly with html. That said check out (one of) perl's html [parser's](http://search.cpan.org/dist/HTML-Parser/Parser.pm) – gwillie Oct 23 '13 at 00:08
  • @Ethan Brown: I am looking for . So anything inside <> and not outside it. Sorry for the poorly worded question, but I figured it out, anyways. Thanks. – user333746 Oct 23 '13 at 05:45
  • If you *must* use regexp, Regexp::Common (or others) would be a good starter. Sadly it doesn't support HTML but this is "forthcoming". – ashley Oct 24 '13 at 07:04
  • @meda [Obligatory link](http://stackoverflow.com/q/4231382/471272): please link to actual answers, not to non-answers. – tchrist Jun 08 '14 at 20:11

4 Answers4

1

Ethan Brown namechecks HTML::DOM as if it were the only CPAN solution.

HTML::Parser is more ubiquitous, but it's not hard to Google for more.

http://metacpan.org/pod/HTML::Parser

A solution using HTML::Parser is (tested once):

use HTML::Parser ();

my $p = HTML::Parser->new(api_version => 3);
$p->handler( text => sub { }, "");
$p->handler( default => sub { print shift }, "text");
$p->parse_file('content.html') || die $!;
szabgab
  • 5,884
  • 9
  • 45
  • 61
ashley
  • 519
  • 4
  • 11
1

If this is regex s///ubstitution to remove all html from document

$string =~ s/<[^>]+>//g;

Then you can use the same regex in a m//atch operator to keep all html from document

$string = join '', $string =~ m/<[^>]+>/g;

If the above regex satisfies your requirements, then you're done :) But maybe you want to consider this ol' regex pattern, slightly longer :D http://perlmonks.org/?node_id=161281 Mind the caveats like Ethan Browne mentions :)

optional
  • 2,043
  • 10
  • 16
0

Are you looking for this?

$string =~ s/>[^<]*</></mg;

Or this?

$string =~ s/(?<=>)[^<]*(?=<)//mg;
traybold
  • 434
  • 4
  • 4
  • 3
    Your solution fails on comments like `

    ` → `-->

    ` and on `script` tags like `` → ``. Also, text at the end of a document without explicit head or body isn't removed: `

    Headline

    Text until EOF` → `

    Text until EOF`

    – amon Oct 23 '13 at 07:47
0

LibXML makes it easy to select stuff that isn't tags/comments/processing-instruction and remove it

#!/usr/bin/perl --
use strict;
use warnings;
use XML::LibXML 1.70; ## for load_html/load_xml/location
use XML::LibXML::PrettyPrint;

Main( @ARGV );
exit( 0 );
sub Main {
    binmode STDOUT;
    my $loc = shift or die "
Usage:
    $0  ko00010.html
    $0  http://example.com/ko00010.html\n\n";

    my $dom = XML::LibXML->new(
        qw/
          recover 2
          no_blanks 1
          /
    )->load_html( location => $loc, );

## http://www.w3.org/TR/xpath/#node-tests
## http://www.w3.org/TR/xpath/#NT-NodeType
## http://www.w3.org/TR/xpath/#section-Text-Nodes
    for my $text ( $dom->findnodes(q{ //text() }) ){
        node_detach( $text );
    }


    local $XML::LibXML::skipXMLDeclaration = 1; ## <?xml ?>
    local $XML::LibXML::setTagCompression = 0;  ## <p />

#~     print "$dom";

    my $pp  = XML::LibXML::PrettyPrint->new_for_html;
    $pp->{indent_string}=' ';
    print $pp->pretty_print( $dom );
}
sub node_detach {
    my( $self ) = @_;
    $self->parentNode->removeChild( $self );
}
optional
  • 2,043
  • 10
  • 16
  • It's worth noting that any compliant DOM-based solution will wrap the HTML fragment inside a minimal `...` fragment. This parser also sticks to HTML4 semantics (in contrast to HTML5), and will introduce closing tags where there weren't any in our input. – amon Oct 24 '13 at 08:43