-2

I have a large data set of html text, and I frequently find unnecessary, and sometimes multiple, <br> line breaks within <li> tags.

For example:

<li>Some string here<br></li><br><li>Another string here<br><br></li><br>

I would like to remove these <br> that appear between <li> and </li> but preserve everything else, including <br> outside of <li> tags. The text above would become:

<li>Some string here</li><br><li>Another string here</li><br>

What is the regular expression for doing this with preg_replace() in php (or re.sub() in python)?

Andy Lester
  • 81,480
  • 12
  • 93
  • 144
  • 8
    *"What is the regex?"* How much you gonna pay me? – Kermit Jan 04 '13 at 21:23
  • Are you using PHP to put the content in the li tags? – Aaron Miler Jan 04 '13 at 21:25
  • @AaronMiller No, just trying to remove from raw text. I'm not inserting anything. – user1521440 Jan 04 '13 at 21:27
  • [lxml](http://lxml.de/lxmlhtml.html#cleaning-up-html) module may not do exactly what you want, but it is generally helpful for cleaning up html. – Yavar Jan 04 '13 at 21:27
  • 1
    you do **NOT** use regexes to mangle html. you use DOM. – Marc B Jan 04 '13 at 21:32
  • See [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) accepted answer. – John V. Jan 04 '13 at 21:32
  • 1
    I suggest you to use a `HTML` parser to find leading and trailing `
    `s within `
  • ` tags. The suggested regexes here may work with you example, but be aware that `HTML` is not a Regular Language and can't be parsed with Regular Expressions generally!
  • – fardjad Jan 04 '13 at 21:33
  • I'm not trying to render with html. I'm just treating the html as flat text to do a batch analysis that clusters on certain tags. This isn't for creating valid html code. – user1521440 Jan 04 '13 at 21:40
  • @user1521440: "I'm just treating the html as flat text" and that's the problem. HTML is not flat text. Use an HTML parser. http://htmlparsing.com/ has examples for both PHP and Python. BeautifulSoup in Python is especially powerful. – Andy Lester Jan 04 '13 at 21:47