12

I want to break next string into sentences:

library(NLP) # NLP_0.1-7  
string <- as.String("Mr. Brown comes. He says hello. i give him coffee.")

I want to demonstrate two different ways. One comes from package openNLP:

library(openNLP) # openNLP_0.2-5  

sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")  
boundaries_sentences<-annotate(string, sentence_token_annotator)  
string[boundaries_sentences]  

[1] "Mr. Brown comes."   "He says hello."     "i give him coffee."  

And second comes from package stringi:

library(stringi) # stringi_0.5-5  

stri_split_boundaries( string , opts_brkiter=stri_opts_brkiter('sentence'))

[[1]]  
 [1] "Mr. "                              "Brown comes. "                    
 [3] "He says hello. i give him coffee."

After this second way I need to prepare sentences to remove extra spaces or break a new string into sentences again. Can I adjust stringi function to improve result's quality?

When it is about a big data, openNLP is (very much) slower then stringi.
Is there a way to combine stringi (->fast) and openNLP (->quality)?

MWiesner
  • 7,913
  • 11
  • 31
  • 66
SRRussel
  • 121
  • 4
  • 5
    if you don't get an answer here, you may have luck on the [corpus linguistics with R forum](https://groups.google.com/forum/#!forum/corpling-with-r) – drammock Aug 07 '15 at 22:48
  • 1
    I opened this as an issue on **stringi**'s HitHub page as well: https://github.com/Rexamine/stringi/issues/184 – Tyler Rinker Aug 10 '15 at 23:18
  • OpenNLP and stringi differ from each other about how to detect sentence boundaries. stringi seems work with a set of rules. And openNLP works with a model from a learning proces. But I still don't see where bottle neck lies... – SRRussel Aug 13 '15 at 14:36

2 Answers2

9

Text boundary (in this case, sentence boundary) analysis in ICU (and thus in stringi) is governed by the rules described in Unicode UAX29, see also ICU Users Guide on the topic. We read:

[The Unicode rules] cannot detect cases such as “...Mr. Jones...”; more sophisticated tailoring would be required to detect such cases.

In other words, this cannot be done without a custom dictionary of non-stop words, which in fact is implemented in openNLP. A few possible scenarios to incorporate stringi for performing this task would therefore include:

  1. Use stri_split_boundaries and then write a function deciding on which incorrectly split tokens should be joined.
  2. Manually input non-breaking spaces into the text (possibly after dots following etc., Mr., i.e. and so on (note that this in fact is required when preparing documents in LaTeX -- otherwise you get too huge spaces between words).
  3. Incorporate a custom non-stop word list into a regex and apply the stri_split_regex.

and so on.

gagolews
  • 12,140
  • 2
  • 43
  • 71
5

This may be a viable regex solution:

string <- "Mr. Brown comes. He says hello. i give him coffee."
stringi::stri_split_regex(string, "(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!)\\s")

## [[1]]
## [1] "Mr. Brown comes."   "He says hello."     "i give him coffee."

Performs less well on:

string <- "Mr. Brown comes! He says hello. i give him coffee.  i will got at 5 p. m. eastern time.  Or somewhere in between"
Tyler Rinker
  • 99,090
  • 56
  • 292
  • 477