1

I have a long text like this:

text = 'Quisiera yo detectar los puntos... pero solo los puntos aislados. Los puntos suspensivos no los quiero detectar. A eso me refiero.'

and I want to get this output:

phrases = ['Quisiera yo detectar los puntos... pero solo los puntos aislados.',
' Los puntos suspensivos no los quiero detectar.',
' A eso me refiero.']

The problem are the three dots in the first phrase. I can't find a regex which discrimines them from the common one-dot separator. Is there a way to achieve it with regex?

Diego Buendia
  • 97
  • 1
  • 7

3 Answers3

5

You want to handle the .. (or ..., etc.) differently and combine it with a negative lookahead:

(?:[^.]|\.{2,})+\.

Explanation:

  • (?:[^.]|\.{2,})+ will match any string that consists of non-. characters or groups of 2 or more .s
  • \. requires a period, of course

Here's a demo.

elixenide
  • 42,388
  • 14
  • 70
  • 93
  • 1
    This approach works nicely with `re.findall`. For example: `re.findall(r'(?:[^.]|\.{2,})+\.', text)` – benvc Apr 12 '19 at 20:52
  • This is my preferred solution, since it does not make assumptions about the characters after the period. Therefore, if space is missing or if there is some type of punctuation, the regular expression still works. – Diego Buendia Apr 14 '19 at 09:09
3

You can use a positive lookbehind to only split on whitespace not preceeded by more than one dot. This approach would ignore any sequence of 2 or more dots.

For example:

import re

s = 'Quisiera yo detectar los puntos... pero solo los puntos aislados. Los puntos suspensivos no los quiero detectar. A eso me refiero.'

sentences = re.split(r'(?<=[^.]\.)\s', s)
print(sentences)
# ['Quisiera yo detectar los puntos... pero solo los puntos aislados.', 'Los puntos suspensivos no los quiero detectar.', 'A eso me refiero.']
benvc
  • 12,401
  • 3
  • 22
  • 45
1

Try this...

import re

text = 'Quisiera yo detectar los puntos... pero solo los puntos aislados. Los puntos suspensivos no los quiero detectar. A eso me refiero.'

pattern = r"(?<=\.)\s(?=[A-Z])"
re.split(pattern, text)

The result should be...

['Quisiera yo detectar los puntos... pero solo los puntos aislados.',
 'Los puntos suspensivos no los quiero detectar.',
 'A eso me refiero.']

My answer is based on this SO answer.

Update:
Looking through some of the answers using the regex tag I came across this metadiscussion as well as this answer. My answer did not come from an innate knowledge of regular expressions but rather from spending about 17 minutes googling different search terms and poking around Stack Overflow. In the intervening 17 minutes or so it took me to craft my answer the other two answers showed up.
I realized that my answer was more the "show me the code" rather than "teach a man to fish" sort of answer. Bottom lining my sentiments I would say that when I'm in acute need of help I want someone to show me the code. But being able to google for solutions to problems is an important skill but also a terrible drug. Hopefully my solution helped but I would also strongly recommend checking out the links in my update. If anything for the perspective as to the state of the regex tag and about making stack overflow more meaningful.

VanBantam
  • 679
  • 4
  • 22