Regex for parsing HTML tables?

Question

I already tried a lot of different regex but couldn't find the solution.

I need a regex to find:

<tr>
    <td>XYZ</td>
    <td>XYZ</td>
</tr>
<tr>
    <td>XYZ</td>
</tr>
<tr>
    <td>XYZ</td>
</tr>

This is what I have so far:

<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>\s*</tr>(\s*<tr>\s*<td>.*?</td>\s*</tr>)*

So the first <tr> block must contain two <td> tags and all following (0 or many) <tr> tags must only contain 1 <td> tag.

Thanks a lot in advance.

@Andreas Yeah I know but it's a task for a uni and we must use regex. Usually I do it with a HTML parser :( — Peter Software, Nov 18 '19 at 20:34
Please show whoever give you such task those questions/answers: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454, [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](https://stackoverflow.com/q/701166), [Using regular expressions to parse HTML: why not?](https://stackoverflow.com/q/590747). — Pshemo, Nov 18 '19 at 20:35
But just for "fun", lets work on this regex. What problems you are facing? What do you have so far? — Pshemo, Nov 18 '19 at 20:41
@Pshemo My regex looks like this so far: `(.*?(.*?)){2}.*?(.*?.*?(.*?).*?)*?` — Peter Software, Nov 18 '19 at 20:45
Dot `.` can't by default match line separators like `\r` or `\n`. To let it match them via `.*` compile your regex with `DOTALL` flag like `Pattern p = Pattern.compile(yourRegex, Pattern.DOTALL);`. Also last `?` causes last rows to not be matched since `?` makes `*` reluctant. Consider removing it. — Pshemo, Nov 18 '19 at 20:52
OR don't use dot in `.*` at all. If you want to match line separator just use `\R` instead, or `\s` which will also match other whitespaces like tabs. Using `.*` will also match extra `...`. — Pshemo, Nov 18 '19 at 20:54
@Pshemo Yep, I used DOTALL and MULTILINE as flags. I updated the regex to: `\s*(.*?)\s*(.*?)\s*(\s*\s*.*?\s*)*` but it's still not working. — Peter Software, Nov 18 '19 at 21:03
How are you using this regex? Can you provide some code example? Can you include link to it like on https://ideone.com/? — Pshemo, Nov 18 '19 at 21:30
@Pshemo Here 2 files combined: https://hastebin.com/noqovupame.java The input file is this page locally: https://de.wikipedia.org/wiki/Liste_von_Comicverfilmungen — Peter Software, Nov 18 '19 at 21:43
What is the **actual** content you are getting via `String allMoviesEnumeration = scanner.next();`? Can you post it, or preferably its fragment which will let us see actual data you are parsing along with expected results? — Pshemo, Nov 18 '19 at 22:15
OK so what is the expected output you want to get, and what you see instead? — Pshemo, Nov 18 '19 at 22:58
[Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. — Toto, Nov 19 '19 at 12:38

Hassan Ibraheem · Answer 1 · 2019-11-19T04:41:03.357

-1

This is the regex format for extracting HTML tables from web-page source code:

(?is)<tr.*?>.*?(?:<td.*?>(.*?)<\/td>\s*)(?=(?:<td.*?>(.*?)<\/td>)?).*?<\/tr>

you can apply the above format in any programming language. it depends on how dose the used language deal with regex.

edited Nov 19 '19 at 04:41

answered Nov 18 '19 at 20:07

Hassan Ibraheem

69
6

``? --- Why are you escaping `/`? Java regex doesn't need that. --- What is the purpose of the `(?: )` non-capturing group? – Andreas Nov 18 '19 at 20:10
This doesn't work as expected. I'm not an expert but I need to "count" the occurrences of the TD-Tags. How to implement that? – Peter Software Nov 18 '19 at 20:15

Booboo · Answer 2 · 2019-11-19T12:15:59.567

Processing HTML tags with regular expressions is problematic; one should use an HTML parse if at all possible. Let's take a simple case of recognizing (fictional) tags <a> and <b>. To keep it simple we will assume that we do not have to worry about attributes on these tags or white space. We are interested in recognizing a single <b> tag nested within an <a> tag, such as:

<a><b>1</b></a>

The "obvious" but incorrect regular expression is:

<a><b>.*?</b></a>

It will match the above example, but it will also match:

<a><b>1</b><b>2</b></a>

Even though .*? is not greedy, it is as greedy as it needs to be in an attempt to match the rest of the input against the rest of the regular expression.

You need to replace .*? with something that will not scan past the closing </b> tag:

((?!</b>).)*

This says that as long as the next characters are not the closing </b> tag, scan one more character. For good measure you might also want to ensure you do not skip over the start of another <a> tag:

((?!(<a>|</b>)).)*

So the final regex becomes:

<a><b>((?!(<a>|</b>)).)*</b></a>

Anyway, that's the approach I have taken. Consequently, the regex for the problem at hand becomes rather complicated.

My understanding is that you are looking for a <tr> tag with two nested <td> tags followed by 0 or more <tr> tags with one nested <td> tag. If I have that straight, then the regex is:

"(?s)<tr[^>]*>(\\s*<td[^>]*>((?!(<tr|</td)).)*</td>\\s*){2}\\s*</tr>(\\s*<tr[^>]*>\\s*<td[^>]*>((?!(<tr|</td)).)*</td>\\s*</tr>)*"

The code:

import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.regex.MatchResult;

public class Test
{
    public static void doMatch (String s) {
        Pattern pattern = Pattern.compile("(?s)<tr[^>]*>(\\s*<td[^>]*>((?!(<tr|</td)).)*</td>\\s*){2}\\s*</tr>(\\s*<tr[^>]*>\\s*<td[^>]*>((?!(<tr|</td)).)*</td>\\s*</tr>)*");
        Matcher matcher = pattern.matcher(s);
        while (matcher.find()) {
            MatchResult m = matcher.toMatchResult();
            System.out.println("Match: " + m.group(0));
        }
    }

    public static void main(String[] args) {
        String s = "<tr>\n    <td>XYZ</td>\n    <td><tag1>abc\ndef</tag2></td>\n</tr>\n<tr>\n    <td>XYZ</td>\n</tr>\n<tr>\n    <td>XYZ</td>\n</tr>";
        Test.doMatch(s);
        s = "<tr><td>1></td><td>2</td></tr><tr><td>3></td><td>4</td></tr><tr><td>5></td><td>6</td></tr><tr><td>7</td></tr>";
        Test.doMatch(s);
    }
}

Prints:

Match: <tr>
    <td>XYZ</td>
    <td><tag1>abc
def</tag2></td>
</tr>
<tr>
    <td>XYZ</td>
</tr>
<tr>
    <td>XYZ</td>
</tr>
Match: <tr><td>1></td><td>2</td></tr>
Match: <tr><td>3></td><td>4</td></tr>
Match: <tr><td>5></td><td>6</td></tr><tr><td>7</td></tr>

Regex for parsing HTML tables?

2 Answers2