Processing HTML tags with regular expressions is problematic; one should use an HTML parse if at all possible. Let's take a simple case of recognizing (fictional) tags <a>
and <b>
. To keep it simple we will assume that we do not have to worry about attributes on these tags or white space. We are interested in recognizing a single <b>
tag nested within an <a>
tag, such as:
<a><b>1</b></a>
The "obvious" but incorrect regular expression is:
<a><b>.*?</b></a>
It will match the above example, but it will also match:
<a><b>1</b><b>2</b></a>
Even though .*?
is not greedy, it is as greedy as it needs to be in an attempt to match the rest of the input against the rest of the regular expression.
You need to replace .*?
with something that will not scan past the closing </b>
tag:
((?!</b>).)*
This says that as long as the next characters are not the closing </b>
tag, scan one more character. For good measure you might also want to ensure you do not skip over the start of another <a>
tag:
((?!(<a>|</b>)).)*
So the final regex becomes:
<a><b>((?!(<a>|</b>)).)*</b></a>
Anyway, that's the approach I have taken. Consequently, the regex for the problem at hand becomes rather complicated.
My understanding is that you are looking for a <tr>
tag with two nested <td>
tags followed by 0 or more <tr>
tags with one nested <td>
tag. If I have that straight, then the regex is:
"(?s)<tr[^>]*>(\\s*<td[^>]*>((?!(<tr|</td)).)*</td>\\s*){2}\\s*</tr>(\\s*<tr[^>]*>\\s*<td[^>]*>((?!(<tr|</td)).)*</td>\\s*</tr>)*"
The code:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.regex.MatchResult;
public class Test
{
public static void doMatch (String s) {
Pattern pattern = Pattern.compile("(?s)<tr[^>]*>(\\s*<td[^>]*>((?!(<tr|</td)).)*</td>\\s*){2}\\s*</tr>(\\s*<tr[^>]*>\\s*<td[^>]*>((?!(<tr|</td)).)*</td>\\s*</tr>)*");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
MatchResult m = matcher.toMatchResult();
System.out.println("Match: " + m.group(0));
}
}
public static void main(String[] args) {
String s = "<tr>\n <td>XYZ</td>\n <td><tag1>abc\ndef</tag2></td>\n</tr>\n<tr>\n <td>XYZ</td>\n</tr>\n<tr>\n <td>XYZ</td>\n</tr>";
Test.doMatch(s);
s = "<tr><td>1></td><td>2</td></tr><tr><td>3></td><td>4</td></tr><tr><td>5></td><td>6</td></tr><tr><td>7</td></tr>";
Test.doMatch(s);
}
}
Prints:
Match: <tr>
<td>XYZ</td>
<td><tag1>abc
def</tag2></td>
</tr>
<tr>
<td>XYZ</td>
</tr>
<tr>
<td>XYZ</td>
</tr>
Match: <tr><td>1></td><td>2</td></tr>
Match: <tr><td>3></td><td>4</td></tr>
Match: <tr><td>5></td><td>6</td></tr><tr><td>7</td></tr>