0

While splitting String I found different result using regex. Consider the following examples -

  1. using string.split() gives me a list excluding the delimiter- .
var3 = "black.white"
print(var3.split('.'))

# result
# ['black', 'white']
  1. using re.split() using this regex- ([.] *) gives me a list including the delimiter - .
var3 = "black.white"
print(re.split('([.] *)', var3))

# result
# ['black', '.', 'white']
  1. using re.split() with this regex- [.] * without the grouping parenthesis () gives me a list excluding the delimiter - .
var3 = "black.white"
print(re.split('[.] *', var3))

# result
# ['black', 'white']

I know there is something to do with the grouping parenthesis () but couldn't understand why. Therefore I have these three question in mind -

  1. Why string.split() doesn't keep the delimiter
  2. Why re.split() keeps the delimiter
  3. Why grouping parenthesis () in regex makes the difference

note: I am new to python and regex

zealous
  • 6,861
  • 3
  • 10
  • 30
FarhanEnzo
  • 31
  • 5
  • 2
    Parentheses defines *capturing group* in regular expression, so when you put your expression in parentheses - you define that it have to be captured. – Olvin Roght Jul 17 '20 at 06:01
  • 1
    This is just how the `re.split` API works. If you place a capture group around the delimiter, then Python will retain it. By the way, the regex pattern on which are really splitting here is just `([.])` ... the whitespace is not being used. – Tim Biegeleisen Jul 17 '20 at 06:01
  • @TimBiegeleisen your plan words explains a lot comparing the official docs! – FarhanEnzo Jul 17 '20 at 06:41

1 Answers1

0

1. Why string.split() doesn't keep the delimiter

Because that's just what's been decided it should do. This is generally preferable to keeping it in most cases, like if you wanted to split words by whitespace, for example.

 

2. Why re.split() keeps the delimiter

It doesn't—not without a capture group (denoted by the parenthesis) in the pattern. This mirrors the str.split API.

 

3. Why grouping parenthesis () in regex makes the difference

When you put something in parenthesis in a regex pattern, it becomes a "capture group". Usually, this lets you match a regex pattern against a string, and then "capture" certain parts of that string. e.g.,

re.match('h.llo, wo(.)ld!', 'hello, world!').group(1)

returns r (because it's in the 1st capture group). Group 0 will always be the entire matched string, in this case: hello, world!.

 

It's just a (convenient) feature of re.split that it will include capture groups in the resulting list if they are present.

Liam
  • 157
  • 10