9

How can I split a text into an array of sentences?

Example text:

Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End

Should output:

0 => Fry me a Beaver.
1 => Fry me a Beaver!
2 => Fry me a Beaver?
3 => Fry me Beaver no. 4?!
4 => Fry me many Beavers...
5 => End

I tried some solutions that I've found on SO through search, but they all fail, especially at the 4th sentence.

/(?<=[!?.])./

/\.|\?|!/

/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])/

/(?<=[.!?]|[.!?][\'"])\s+/    // <- closest one
thelolcat
  • 9,215
  • 18
  • 56
  • 94
  • 2
    The sentence #4 doesn't follow standard syntax. You need a class of `Terminators` - tokens that mark the end of a sentence. If you use one of the terminators as a regular symbol, then it's either not a terminator or you're misforming the sentences. You can't have your cake and eat it too, to put it simply. – Shark May 04 '13 at 18:14
  • I make cakes and eat them all the time :P Can a regex look ahead like 2 characters and if 2nd character is not uppercase A-Z it means that the punctuation before is not valid – thelolcat May 04 '13 at 18:16
  • Sounds like you already know what needs to be done. – Shark May 04 '13 at 18:20
  • But how do i get that into the regex? – thelolcat May 04 '13 at 18:21
  • @thelolcat you are better off with your own parser..a single regex won't do! You have to consider sentences which contains `Mr.thelolcat`, `no.1` – Anirudha May 04 '13 at 18:32
  • what computer in this world should know that this: `no. 4?!` is the end of a sentence? What if it's `no. 4 (the number after 3)?!` You currently entering spheres which are reserved for Chuck Norris – hek2mgl May 04 '13 at 18:40
  • @lolcat what your asking can be done with regexes, what you need is a zero width assertion, also the last regex you gave seems to work, what do you think is wrong with it – aaronman May 04 '13 at 18:41

1 Answers1

27

Since you want to "split" sentences why are you trying to match them ?

For this case let's use preg_split().

Code:

$str = 'Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End';
$sentences = preg_split('/(?<=[.?!])\s+(?=[a-z])/i', $str);
print_r($sentences);

Output:

Array
(
    [0] => Fry me a Beaver.
    [1] => Fry me a Beaver!
    [2] => Fry me a Beaver?
    [3] => Fry me Beaver no. 4?!
    [4] => Fry me many Beavers...
    [5] => End
)

Explanation:

Well to put it simply we are spliting by grouped space(s) \s+ and doing two things:

  1. (?<=[.?!]) Positive look behind assertion, basically we search if there is a point or question mark or exclamation mark behind the space.

  2. (?=[a-z]) Positive look ahead assertion, searching if there is a letter after the space, this is kind of a workaround for the no. 4 problem.

HamZa
  • 13,530
  • 11
  • 51
  • 70