1

I have a lot of HTML files which have unwanted line-feeds. These break things like inline javascript and formatting within the pages. I want to come up with a way to strip out all line feeds from the pages that do not appear directly after an html tag e.g </div>. Does anyone know of a regex and/or program that may be able to acheive this?

soulmerge
  • 68,989
  • 18
  • 113
  • 147

2 Answers2

1

You may be able to use Notepad++'s search/replace function, with a regular expression to catch most of this.

Something like:

([^>])\n(.+)

Replaced with:

\1 \2
DisgruntledGoat
  • 62,693
  • 62
  • 192
  • 281
  • 1
    Depending on the format of the html file, you may need to use ([^>])\r\n(.+) or ([^>])\r(.+) instead. – Brian Sep 16 '09 at 13:07
0

You can use a negative lookbehind to match the line feeds

<?php

$buffer = file_get_contents('test.html');

// replace all line feeds not preceded by </div>
$buffer = preg_replace('|(?<!</div>)[\r\n]|', "", $buffer);

file_put_contents('test.new.html', $buffer);
?>

see: http://www.regular-expressions.info/lookaround.html

Lance Rushing
  • 7,210
  • 4
  • 26
  • 33