How can I extract the date from this HTML table?

Question

I am trying to get the "date" from the second cell in the table using regex, but it is not matching, and I really can't find out why.

my $str = '"    
    <td class="fieldLabel" height="18">Activation Date:</td>
    <td class="dataEntry" height="18">
        10/27/2011      
    </td>';

if ( $str =~ /Activation Date.*<td.*>(.*)</gm ) {
    print "matched: ".$1;
}else{
    print "mismatched!";
}

[The pony he comes...](http://stackoverflow.com/a/1732454/554546) — , Mar 08 '12 at 18:39
In balance, see [Tchrist's response here](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string) — JRFerguson, Mar 08 '12 at 19:03
@JRFerguson: I think I make a cameo appearance there too :-) — Platinum Azure, Mar 08 '12 at 20:45
@PlatinumAzure: Indeed you did and I liked your answer. Prohibition and admonishment without the reason teaches nothing. Regards! — JRFerguson, Mar 08 '12 at 21:05

brian d foy · Accepted Answer · 2020-12-30T10:44:28.003

Others have already pointed out that you want the /s option to make . match a newline so you can cross logical line boundaries with .*. You might also want the non-greedy .*?:

use v5.10;

my $html = <<'HTML';    
    <td class="fieldLabel" height="18">Activation Date:</td>
    <td class="dataEntry" height="18">
        10/27/2011      
    </td>
HTML

my $regex = qr|
    <td.*?>Activation \s+ Date:</td>
        \s*
    <td.*?class="dataEntry".*?>\s*
        (\S+)
    \s*</td>
    |xs;
    
if ( $html =~ $regex ) {
    say "matched: $1";
    }
else {
    say "mismatched!";
    }

(2020 Update) But I'd use Mojo::DOM and CSS Selectors to get the date. The particular selector may depend on the complete HTML source, but the idea is the same:

use v5.10;

use Mojo::DOM;
use Mojo::Util qw(trim);

my $html = <<'HTML';
    <td class="fieldLabel" height="18">Activation Date:</td>
    <td class="dataEntry" height="18">
        10/27/2011
    </td>
HTML

my $dom = Mojo::DOM->new( $html );
my $date = trim( $dom->at( 'td.dataEntry' )->all_text );

say "Date is $date";

If you have the complete table, it's easier to use something that knows how to parse tables. Let a module such as There's also HTML::TableParser handle all of the details:

use v5.10;

my $html = <<'HTML';
    <table>
    <tr>
    <td class="fieldLabel" height="18">Activation Date:</td>
    <td class="dataEntry" height="18">
        10/27/2011      
    </td>
    </tr>
    </table>
HTML

use HTML::TableParser;
  
sub row {
    my( $tbl_id, $line_no, $data, $udata ) = @_;
    return unless $data->[0] eq 'Activation Date';
    say "Date is $data->[1]";
    }
 
# create parser object
my $p = HTML::TableParser->new( 
    { id => 1, row => \&row, } 
    { Decode => 1, Trim => 1, Chomp => 1, } 
    );
$p->parse( $html );

There's also HTML::TableExtract:

use v5.10;

my $html = <<'HTML';
    <table>
    <tr>
    <td class="fieldLabel" height="18">Activation Date:</td>
    <td class="dataEntry" height="18">
        10/27/2011      
    </td>
    </tr>
    </table>
HTML

use HTML::TableExtract;
  
my $p = HTML::TableExtract->new;
$p->parse( $html );
my $table_tree = $p->first_table_found;
my $date = $table_tree->cell( 0, 1 );
$date =~ s/\A\s+|\s+\z//g;
say "Date is $date";

score 3 · Answer 2 · edited May 23 '17 at 10:24

You might be misunderstanding the regex flags.

/m implies that you might be trying to match against multiple lines by ensuring that ^ can mean beginning of a line and $ can mean end of a line.
/s implies that you want to treat a multiple line expression as a single line expression by allowing . to mean any character, including newline. Normally, . means any character except newline.

If you add the /s flag, your regex should work, although you really shouldn't parse HTML with regex anyway.

How can I extract the date from this HTML table?

2 Answers2