-1
>>> sentence = "Thomas Jefferson began building Monticello at the age of 26."
>>> tokens1 = re.split(r"([-\s.,;!?])+", sentence)
>>> tokens2 = re.split(r"[-\s.,;!?]+", sentence)
>>> tokens1 = ['Thomas', ' ', 'Jefferson', ' ', 'began', ' ', 'building', ' ', 'Monticello', ' ', 'at', ' ', 'the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']
>>> tokens2 = ['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '']

Can you explain the purpose of ( and )?

X. Wang
  • 733
  • 8
  • 16
  • Inside a regular expression, `(..)` is a "capture group" and `[..]` is a "character class". The documentation will explain what these mean (and how capture groups apply to split) in more detail. – user2864740 Jan 02 '18 at 01:34
  • 1
    Read the [docs](https://docs.python.org/3/library/re.html) before asking here. – user2357112 supports Monica Jan 02 '18 at 01:34
  • @user2864740 Yes, but why would simply _capturing_ `([...])` change what is matched? – Tim Biegeleisen Jan 02 '18 at 01:35
  • @TimBiegeleisen `(..)` changes the behavior of `split`. Hence the follow-on sentence. There is definitely some duplicates.. – user2864740 Jan 02 '18 at 01:36
  • If you want the group to not affect the split you can use `(?:)` – mdatsev Jan 02 '18 at 01:37
  • Anyway, the documentation excerpt (already linked above) here: ["*If capturing parentheses are used* in pattern, then *the text of all groups in the pattern are also returned* as part of the resulting list."](https://docs.python.org/3/library/re.html) – user2864740 Jan 02 '18 at 01:40
  • 1
    @TimBiegeleisen it doesn't change what is matched. exactly the same things are matched. it does, however, change what split returns (a very different thing) – ysth Jan 02 '18 at 01:44
  • You may find [this reference on SO](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) helpful in future – Patrick Haugh Jan 02 '18 at 01:44
  • @ysth I am novice in Python. In Java, there would be no difference, and perhaps the OP is also coming from a different background. – Tim Biegeleisen Jan 02 '18 at 01:50

2 Answers2

3

(..) in a regex denotes a capturing group (aka "capturing parenthesis"). They are used when you want to extract values out of a pattern. In this case, you are using re.split function which behaves in a specific way when the pattern has capturing groups. According to the documentation:

re.split(pattern, string, maxsplit=0, flags=0)

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

So normally, the delimiters used to split the string are not present in the result, like in your second example. However, if you use (), the text captured in the groups will also be in the result of the split. This is why you get a lot of ' ' in the first example. That is what is captured by your group ([-\s.,;!?]).

user2864740
  • 54,112
  • 10
  • 112
  • 187
Sweeper
  • 145,870
  • 17
  • 129
  • 225
1

With a capturing group (()) in the regex used to split a string, split will include the captured parts.

In your case, you are splitting on one or more characters of whitespace and/or punctuation, and capturing the last of those characters to include in the split parts, which seems kind of a weird thing to do. I'd have expected you might want to capture all of the separator, which would look like r"([-\s.,;!?]+)" (capturing one or more characters whitespace/punctuation characters, rather than matching one or more but only capturing the last).

ysth
  • 88,068
  • 5
  • 112
  • 203