1

I am trying to get the contents of TRs on a web page that have no TRs nested inside them. The HTML is nested with many TRs

I am limited to RegEx only for this problem.

This is good:

TR
    Contents
/TR

This is not

TR
   other HTML
     TR
        Contents      
Ian Vink
  • 60,720
  • 99
  • 311
  • 535
  • 4
    [Noooooeeeeeesssss!!!!111](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – GolezTrol Nov 17 '11 at 21:40
  • 3
    I'm afraid this is an impossible task to do with regexes. You should use some html parser instead. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Łukasz Wiatrak Nov 17 '11 at 21:40
  • Tony the Pony! He comes! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Jonathan M Nov 17 '11 at 21:45
  • 1
    I'm just curious, why are you limited only to the regex solution? – Łukasz Wiatrak Nov 17 '11 at 21:46
  • "Three times is maritime law" (Dutch proverb) – GolezTrol Nov 17 '11 at 21:48
  • @Lucasus I know of an A/B split testing tool, that post-processes rendered pages using regex matches. My first choise would be not to use such a tool, but if your manager tells you too, it may not be worth to loose your job over. In the end, it's just code, even though the `
    ` doesn't hold....
    – GolezTrol Nov 17 '11 at 21:50
  • @Lucasus and "at" GolezTrol Before you post junk comment you may wanna take a look at this : http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491. Be sure to read through it. If you are unable to fathom regexes to solve complex tasks, this doesn't mean that regexes themselves cannot be used to solve those tasks. "At" OP : Are you sure you want to solve this with regex only? – FailedDev Nov 17 '11 at 21:51
  • 2
    @FailedDev I think what people are "unable to fathom" is using a tool that would be needlessly complex to do a task that would be incredibly simple with another tool. Regexes are fantastic. But not for this. In my opinion. – Andrew Barber Nov 17 '11 at 21:53
  • @AndrewBarber Saying that is **IMPOSSIBLE** when something is indeed possible is actually wrong is it not? – FailedDev Nov 17 '11 at 21:54
  • @FailedDev GolezTrol most certainly did not say it was impossible. Not that I can see, anyway. I think Lucasus was applying a liberal dose of hyperbole. Yes; it is not impossible. It's quite difficult, though, especially compared to an HTML parser. – Andrew Barber Nov 17 '11 at 21:58
  • 1
    It's not possible. HTML is not a regular language that can be parsed by a regular expression. You can get *certain strings* in HTML to parse via regex, but accounting for all the possibilities of an HTML tag? No. – Jonathan M Nov 17 '11 at 23:01
  • @JonathanM: you know, a regex is _not_ a regular expression, and hasn't been since back-reference where added in the early '70s. Actually, I think the kind of regexes that tchrist uses in the link FailedDev provided you are at least more powerful than LL(1) grammars. – ninjalj Nov 18 '11 at 00:15
  • @ninjalj, good point. Still, there's no way regex can cover all the legal ways to structure an HTML tag. See my comments in Tim's answer. – Jonathan M Nov 18 '11 at 04:46

1 Answers1

4

This is actually not that much of a problem with regex (assuming you can guarantee that <tr> will not show up in comments, strings etc.; otherwise the regex will mis-match):

<tr\b(?:(?!</?tr\b).)*</tr>

will only match innermost tr tags. Use the dot-matches-newlines option of your regex engine, or it won't work correctly. If you don't have one (JavaScript, I'm talking to you!), then use [\s\S] instead of the ..

Explanation:

<tr\b      # Match a tag that starts with tr
(?:        # Match...
 (?!       # (unless it's possible to match
  </?tr\b  #  <tr or </tr at the current position)
 )
 .         # any character 
)*         # any number of times.
</tr>      # Match </tr>
Tim Pietzcker
  • 297,146
  • 54
  • 452
  • 522
  • Unfortunately I cannot upvote for the next two hours. +1 For proving all these commentators wrong. @OP this is as close as you can get with a simple regex. – FailedDev Nov 17 '11 at 22:01
  • 2
    BTW, this doesn't prove the commentators wrong. This only works for HTML that fits it. It will disqualify the following tag: `` You can rarely predict what folks building HTML are going to do. That's the reason for not parsing it with regex. – Jonathan M Nov 17 '11 at 22:57
  • @JonathanM: no, what proves commentators wrong is the fact that they obviously didn't read even the first sentence in the question: _get contents of TRs ... that have no TRs nested inside_. They just saw the _html_ and _regex_ tags and jumped to link to the Zalgo text. Very mature. – ninjalj Nov 18 '11 at 00:19
  • @ninjalj, yeah I follow what you're saying. However, my example in the above comment has no nested ``, but it would be excluded by this this regex instead of matched as the OP requested. A good rule of thumb with regex's and HTML is, only use it on tags/snippets that you control as the programmer. If others are writing the HTML, but I'm writing the parser, I can't trust that someone won't do something like in the example above. Regex simply can't cover all the legal ways to structure tag content. – Jonathan M Nov 18 '11 at 04:41
  • @JonathanM: This is why I wrote the caveat about ` – Tim Pietzcker Nov 18 '11 at 06:49