Extract HTML table content based on "thead"

Question

Here is a basic HTML table :

<table>
  <thead>
    <td class="foo">bar</td>
  </thead>
  <tbody>
    <td>rows</td>
    …
  </tbody>
</table>

Suppose there are several such tables in the source file. Is there an option of hxextract, or a CSS3 selector I could use with hxselect, or some other tool, which would allow to extract one particular table, either based on the content of thead or on its class if it exists ? Or am I stuck with not so simple awk (or maybe perl, as found before submitting) scripting ?

Update : For content-based extraction, perl's HTML::TableExtract does the trick :

#!/usr/bin/env perl

use open ':std', ':encoding(UTF-8)';
use HTML::TableExtract;

# Extract tables based on header content, slice_columns helpful if colspan issues
$te = HTML::TableExtract->new( headers => ['Multi'], slice_columns => 0);
$te->parse_file('mywebpage.html');

# Loop on all matching tables
foreach $ts ($te->tables()) 
{
  # Print table identification
  print "Table (", join(',', $ts->coords), "):\n";

  # Print table content
  foreach $row ($ts->rows) 
  {
    print join(':', @$row), "\n";
  }
}

However in some cases a simple lynx -dump mywebpage.html coupled wih awk or whatever can be just as efficient.

did you tried to parent selector? $('foo').parent().parent() //this will give you the table that has the foo class in the td — Idan Magled, Sep 22 '14 at 08:45
I'm afraid it doesn't work with `hxselect` or `hxextract`. But anyway the syntax you suggest wouldn't work, so are you thinking about another (command line) tool ? — Skippy le Grand Gourou, Sep 22 '14 at 08:56
He's thinking about jQuery, a JavaScript library. You'll have to forgive folks around here for mistakenly assuming any question involving HTML must somehow involve a Web browser and therefore JavaScript, and that jQuery must be in use - it seems to happen all the time for some reason... — BoltClock, Sep 22 '14 at 11:18
Well, I guess technically I *could* [use JS from CLI](http://stackoverflow.com/questions/2941411/executing-javascript-without-a-browser)… — Skippy le Grand Gourou, Sep 22 '14 at 15:54

score 2 · Accepted Answer · edited May 23 '17 at 12:05

This would require a parent selector or a relational selector, which does not as yet exist (and by the time it does exist, hxselect may not implement it because it does not even fully implement the current standard as of this writing). hxextract appears to only retrieve an element by its type and/or class name, so the best it'd do is td.foo, which would return the td only, not its thead or table.

If you are processing this HTML from the command line, you will need a script.

Extract HTML table content based on "thead"

1 Answers1