0

I have text with paragraph formats, a date is always above each paragraph article. The problem is after each article, there is unknown line breaks that are different kind of unicode line breaks. I need to remove every instance of the line breaks between each paragraph and replace it with two \n\n.

So from this

05/12
The 1959 Mexico hurricane was a devastating tropical cyclone
that was one of the worst ever Pacific hurricanes. It 
impacted the Pacific coast of Mexico in October 1959. The
hurricane killed at least 1,000 people.




11/01
The 1959 Mexico hurricane was a devastating tropical cyclone
that was one of the worst ever Pacific hurricanes. It 
impacted the Pacific coast of Mexico in October 1959. The
hurricane killed at least 1,000 people.

To this

05/12
The 1959 Mexico hurricane was a devastating tropical cyclone
that was one of the worst ever Pacific hurricanes. It 
impacted the Pacific coast of Mexico in October 1959. The
hurricane killed at least 1,000 people.

11/01
The 1959 Mexico hurricane was a devastating tropical cyclone
that was one of the worst ever Pacific hurricanes. It 
impacted the Pacific coast of Mexico in October 1959. The
hurricane killed at least 1,000 people.

I tried using preg_replace() but it's not matching every instance?

$text = preg_replace('/\r?\n+(?=\d{2}\/\d{2})/', "\n\n", $text);
paulie.jvenuez
  • 295
  • 4
  • 11
  • 1
    Perhaps you need to try to match all unicode characters that represent 'line breaks'? I know of another that was screwing up a text tokenizer of mine a week ago - carriage return `\r`. That's just a hint though... **Scratch that** looks like you are matching `\r`. – Chris Cirefice Oct 31 '13 at 00:30

1 Answers1

1

I posted on a similar question about this a month or so back.

To match anything considered a linebreak sequence, you can use \R

\R matches a generic newline; that is, anything considered a linebreak sequence by Unicode. This includes all characters matched by \v (vertical whitespace) and the multi character sequence \x0D\x0A.

Try this instead.

$text = preg_replace('~\R+(?=\d{2}/\d{2})~u', "\n\n", $text);

See the PCRE documentation on different ways to implement this.

Community
  • 1
  • 1
hwnd
  • 65,661
  • 4
  • 77
  • 114