0

I have a document with many lines like this:

<tr><td width="10%">doc_no_320F0321</td><td width="5%">116</td><td> bla bla bla 1976, bla bla point (2) bla bla bla. </td><td> bla bla bla 1976, bla bla point (1) bla bla bla. </td></tr>

(Beautified it would look like this:

<tr>
    <td width="10%">doc_no_320F0321</td>
    <td width="5%">116</td>
    <td> bla bla bla 1976, bla bla point (2) bla bla bla. </td>
    <td> bla bla bla 1976, bla bla point (1) bla bla bla. </td>
</tr>

)

What I need to do is to check if the digits from the third and forth < td > are the same, ignoring the other characters.

For this I'm trying to highlighting them with < mark > so that they are easier to see. I'm running this sed replace:

sed -i -r 's|(<td>.*?)([[:digit:]]+)(.*?<\/td>)|\1<mark>\2<\/mark>\3|g'

But it only surrounds the last digit in each row.

Can someone help me surround ALL combinations of digits in the 3rd and 4th tag?

Thanks.

  • If you only need to match the digits, why are you trying to highlight them in advance? If your regex works to detect them well enough for you to add a `mark` to them, why to you need to tag them? – Sorix May 05 '20 at 16:27
  • oh, that was not clear at all I'm sorry. When I say "match" I mean "make sure they exist in both ", and not "be able to find them with the regex". Wrong terminology there... thanks, I will edit to make it clearer. – Tiago Moreira May 05 '20 at 16:50
  • 2
    [Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. See: http://php.net/manual/en/class.domdocument.php – Toto May 05 '20 at 16:55

3 Answers3

0

It's a bad idea to use Regular Expressions on arbitrary HTML because SGML are not Regular. You need a HTML parser to do this right:

Parse. Find the third and fourth TD children of TRs, and change their text children.

You might be able to get away with parsing known HTML, if you're lucky, by leaving the whole line un-beautified before transforming and counting TDs across in your regexp.

(<tr[^<]+<td[^<]+<td[^<]+<td[^"]"\D*)(\d+)([^"]....)

\1<mark>\2</mark>\3

And the same for the 4th.

But you have problems when your text has more than one number block you want to "mark".

Chad Miller
  • 1,385
  • 7
  • 10
0

If all you want to do is generate an HTML version of your page with highlighted numbers in the specific columns, you could do something like:

$d = new DOMDocument();
$d->loadHTMLFile('your_file_path.html');

$x = new DOMXpath($d);
$third_td = $x->evaluate('//tr/td[3]');
$fourth_td = $x->evaluate('//tr/td[4]');

$pattern = '/\d/';
$replace = '<span style="color: red;">${0}</span>';

foreach ( $third_td as $key => $input ) {
    $input->nodeValue = preg_replace($pattern, $replace, $input->nodeValue);
    $fourth_td[$key]->nodeValue = preg_replace($pattern, $replace, $fourth_td[$key]->nodeValue);
}

echo $d->saveHTML();

The result of $d->saveHTML() is an HTML version where all the numbers in the 3rd and 4th columns are colored in red. If it's what you need, styling can be changed accordingly.

I haven't taken into account handling any missing columns or other incompatibilities that could cause errors.

This code is written in PHP and based on what @Toto suggested.

Hope this helps

Sorix
  • 741
  • 4
  • 17
0

With sed and each row in one line you might be lucky with

sed -r ':a;s#(.*</td>)(.*<td>)(.*[^\r[:digit:]])([[:digit:]]+)#\1\2\3<mark>\r\4</mark>#;ta;s/\r//g'

You shouldn't parse HTML with sed, so this solution is not worth explaining.

Walter A
  • 16,400
  • 2
  • 19
  • 36