3

I'm not so able with regex and I'm looking for the syntax to exclude something. I'm parsing <, >, " and & in html code (to replace with &lt;, etc) and I need to exclude <br/> from parsing. I.E.:

<html><br/>
   <head><title></title></head><br/>
   <body><br/>
   </body><br/>
</html>

I tried sometihng like i.e.: r'<\b?![br]' and others, but they don't work completely. I use re.sub() to replace.

stdio
  • 395
  • 2
  • 6
  • 18
  • I can't and don't want to install external libraries. – stdio Sep 04 '11 at 19:01
  • 2
    @stdio you don't need external libraries; Python comes with the excellent ElementTree (an API which lxml provides an even better implementation of) out of the box. – Charles Duffy Sep 04 '11 at 19:11
  • 1
    XML (like SGML, which it extends) is not a regular language (in the computer science meaning of the term -- if you've taken a compiler design class, they should go into it). Regular expressions are not powerful enough to parse it. – Charles Duffy Sep 04 '11 at 19:13
  • 2
    @Charles Most modern regular expression implementation (including Python's) aren't truly regular. Also closing this answer as a duplicate of that joke post helps the OP in no way. – NullUserException Sep 04 '11 at 19:17
  • 4
    This was erroneously closed as an exact duplicate **OF A JOKE ANSWER!!!** How much more stupid and lame — and wrong — can you possibly get? Voting to reopen. The guy needs deserves to have his question answer. This **BURN THE WITCH** attitude around here is absolutely too damned much! – tchrist Sep 04 '11 at 19:24
  • All you want to do is HTML-escape everything in a string except for that particular tag? Do you already have the escaping going? Let’s see the code you currently have. There are several easy solutions to this. If the question doesn’t get reopened, I’ll post the answer in comments. – tchrist Sep 04 '11 at 19:30
  • @tchrist: thanks man! I need simply to do what I wrote. Parsing/escape all html code, except 'br' tag. – stdio Sep 04 '11 at 19:43
  • Alternatively, post the answer so the OP is helped, and if/when it gets re-opened transfer it to an answer to it can be marked as solved? :/ – Peter Boughton Sep 04 '11 at 20:03
  • 2
    Unless I'm missing something, and once it's just `
    ` (not any variants), then can just replace `)` with `<` and `(?` with `>` and that's it?
    – Peter Boughton Sep 04 '11 at 20:04
  • @Peter Boughton: perfect! I tried something like this, but with small errors :D – stdio Sep 04 '11 at 20:16
  • @Peter: Go ahead and post your solution since he likes it. I’m heading out. – tchrist Sep 04 '11 at 20:21
  • @tchrist: thanks again for the supprt :) – stdio Sep 04 '11 at 20:26
  • @NullUserException - Even modern RE variants don't support recursive descent parsing. They're not suited to task. – Charles Duffy Sep 04 '11 at 23:54
  • 1
    I was once in a similar argument with @tchrist, to which he responded: "Patterns haven’t been ʀᴇɢᴜʟᴀʀ for a really long time now. And don’t tell people what they “can’t” do; you’ll just embarrass yourself when they — or I — show they can. You apparently haven’t read the references I’ve cited. If you had, you would realize that I am perfectly capable and willing to write regexes that are **dynamically self-modifying recursive-descent parsers** in and of themselves. There are more things in heaven and earth than are dreamt of in your automata-theory schoolwork assignments." – NullUserException Sep 05 '11 at 00:08
  • @Charles And by references, he meant: [this](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491), [this](http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes/4843579#4843579) and [this](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326) – NullUserException Sep 05 '11 at 00:09
  • 1
    @NullUserException - Thank you. I learned something here. – Charles Duffy Sep 05 '11 at 05:20

3 Answers3

2

Ok, now the question is open again, I can do it as an answer, so...

Unless I'm missing something, and once it's just <br/> (not any variants), then can just replace <(?!br/>) with &lt; and (?<!<br/)> with &gt; and that's it?


In Python, it looks like that means this:

text = re.sub( '<(?!br/>)' , '&lt;' , text )
text = re.sub( '(?<!<br/)>' , '&gt;' , text )


To explain what's going on, (?!...) is a negative lookahead - it only successfully matches at a position if the following text does not match the sub-expression it contains.
(Note lookaheads do not consume the text matched by their sub-expression, they only verify if it exists, or not.)

Similarly, (?<!...) is a negative lookbehind, and does the same thing but using the preceding text.

However, lookbehinds do have a slight different to lookaheads (in some regex implementations) - which is that the sub-expressions inside lookbehinds must represent fixed-width or limited-width matches.

Python is one of the ones that requires a fixed width - so whilst the above expression works (because it's always four characters), if it was (?<!<br\s*/?)> then it would not be a valid regex for Python because it represents a variable length match. (However, you can stack multiple lookbehinds, so you could potentially manually iterate the assorted options, if that was necessary.)

Peter Boughton
  • 102,341
  • 30
  • 116
  • 172
  • I already said: perfect ;) Now, is there a way to do all in a step? For regex no problem, I can use 'or' operator (|), but is there a way to pass to re.sub() multiple value as second parameter? – stdio Sep 04 '11 at 20:21
  • 1
    You're replacing with different things, so you can't really do it in one step. Well, I think PHP lets you pass in an array (for both regex and replacement), but this isn't mentioned in the Python docs, so would need to be a user-defined function if it's that important. Of course, you can probably also do `re.sub( ')' , '<' , re.sub( '(?' , '>' , text ) )` if it's just a case of not wanting a temporary variable. – Peter Boughton Sep 04 '11 at 20:32
0

Replace everything, then in a second pass replace "&lt;br/&gt;" with "<br/>".

Or, to generalize, have a list of tags you want to 'revert' and replace "&lt;tag&gt;" with "<tag>", "&lt;/tag&gt;" with "</tag>" and "&lt;tag/&gt;" with "<tag/>".

Joaquim Rendeiro
  • 1,358
  • 7
  • 13
  • 1
    Something better and more elegant? however I prefer to use regex. – stdio Sep 04 '11 at 19:03
  • @stdio: But this answer *does* use regex. Once you’re converted everything, just undo the tag you didn’t want to really change. – tchrist Sep 04 '11 at 19:37
  • @tchrist: yes, but it's not so elegant and I prefer to use re.sub() making all in a step (excluding 'br' tag parsing through regex). – stdio Sep 04 '11 at 19:45
  • @stdio: Please edit your answer and show what you’re currently doing so I can see where to modify it. Why are you doing this anyway? Some BB posting you need to launder of all HTML or something? – tchrist Sep 04 '11 at 19:47
  • @stdio: Well, if you want something more elegant.... Something tells me that your html wasn't born like this

    . Go to the source of the problem and remove the premature insertion of the
    s, then insert them at the end of each line after escaping the tags.
    – Joaquim Rendeiro Sep 04 '11 at 19:48
  • @Joaquim Rendeiro: my html code was born similar to it, there aren't premature insertions of
    .
    – stdio Sep 04 '11 at 19:53
  • @tchrist: it's not important my exact html code. There is a part in my python application that works similar to something like nopaste/pastebin (so, there is code sent in BBcode form, "translated" in html code, but I don't have to parse 'br' tag in the parsing step). – stdio Sep 04 '11 at 19:58
  • @stdio: Ok then. I think that would be cleaner, for example if the HTML is being produced by an online editor like TinyMCE or similar, you can usually configure what it produces when the user presses "enter". – Joaquim Rendeiro Sep 04 '11 at 20:00
  • @Joaquim Rendeiro: yea, I thought to send the code to an online service, take back the result and pass it to my application, but it will be one of the latest solutions :D – stdio Sep 04 '11 at 20:07
0

Does this correspond to what you need ? :

import re
import htmlentitydefs

ss = '''
<html>
    <br>
        <title>"War & Peace"</title>
        <body>Leon Tolstoy</body>
    <br/>
</html>'''

print ss
print '\n\n'


uniquechars_repl = '"&'
conditional_repl = {'<':'<(?!br/>)',
                    '>':'(?<!<br/)>'}

all_repl = list(uniquechars_repl) + conditional_repl.keys()

di = dict( (b,'&%s;' % a) for a,b in htmlentitydefs.entitydefs.iteritems()
           if b in all_repl)

pat = '|'.join(list(uniquechars_repl) + conditional_repl.values())

text = re.sub(pat , lambda mat: di[mat.group()], ss )

print text

result

<html>
    <br>
        <title>"War & Peace"</title>
        <body>Leon Tolstoy</body>
    <br/>
</html>




&lt;html&gt;
    &lt;br&gt;
        &lt;title&gt;&quot;War &amp; Peace&quot;&lt;/title&gt;
        &lt;body&gt;Leon Tolstoy&lt;/body&gt;
    <br/>
&lt;/html&gt;
eyquem
  • 24,028
  • 6
  • 35
  • 41