2

I am trying to split a comma delimited string using the following code. The regex code for comma-delimited is used in one of my online courses. I am trying to understand the following regex with lookaround how it works but couldn't get it completely. Can someone let me know how it works?

I know ?: is for non-capturing group and ?= is for look ahead but not sure how it works in the current context.

import re
pattern = re.compile(r',(?=(?:[^"]*"[^"]*")*[^"]*$)')
text = 'tarcac,"this is, test1","this is, test2", 123566, testdata'
results= re.split(pattern, text)
for r in results:
   print(r.strip())

Output is

tarcac
"this is, test1"
"this is, test2"
123566
testdata
AngiSen
  • 729
  • 2
  • 12
  • 30

1 Answers1

3

Let's break this down.

  1. We are trying to look for commas which are separating parts of a string. So first we need to look for a comma

    • ,
  2. Now we need to look ahead to make sure this comma is not within a pair of quotation marks. We look ahead with (?=...)

    • We are looking for the first " so we can match as many non-quotes as possible. (we use ^" to indicate any character that is note a quotation mark) [^"]*, and then we look for the first quote, together this makes:
      • [^"]*"
    • If we have matched the first quote, we need to look for the second. There can be any number of non-quote characters ([^"]*) between them, so we repeat
      • [^"]*"
    • We want to match any two pairs of quotes as many times as possible (without capturing), so we look for zero to infinite occurrences of quoted strings
      • (?:[^"]*"[^"]*")*
    • Together this makes: (?=(?:[^"]*"[^"]*")*)
  3. Finally we want to match any remaining non-quote characters, and then we wish to match the end of the string (indicated by $)
    • [^"]*$

Together all of this gives ,(?=(?:[^"]*"[^"]*")*[^"]*$)

Essentially it is trying to match a comma by checking that every quote " character after the comma, can be paired with a closing quote character. This is why the output doesn't spereate the comma within "this is, test1" and "this is, test2"

Jacob Boertjes
  • 896
  • 3
  • 18