Here is a basic HTML table :
<table>
<thead>
<td class="foo">bar</td>
</thead>
<tbody>
<td>rows</td>
…
</tbody>
</table>
Suppose there are several such tables in the source file. Is there an option of hxextract
, or a CSS3 selector I could use with hxselect
, or some other tool, which would allow to extract one particular table, either based on the content of thead
or on its class if it exists ? Or am I stuck with not so simple awk
(or maybe perl, as found before submitting) scripting ?
Update :
For content-based extraction, perl's HTML::TableExtract
does the trick :
#!/usr/bin/env perl
use open ':std', ':encoding(UTF-8)';
use HTML::TableExtract;
# Extract tables based on header content, slice_columns helpful if colspan issues
$te = HTML::TableExtract->new( headers => ['Multi'], slice_columns => 0);
$te->parse_file('mywebpage.html');
# Loop on all matching tables
foreach $ts ($te->tables())
{
# Print table identification
print "Table (", join(',', $ts->coords), "):\n";
# Print table content
foreach $row ($ts->rows)
{
print join(':', @$row), "\n";
}
}
However in some cases a simple lynx -dump mywebpage.html
coupled wih awk
or whatever can be just as efficient.