1

I'm trying to create a function to capture phone numbers written in a canonical form (XXX)XXX-XXX or XXX-XXX-XXXX with additional conditions. This is my approach

def parse_phone2(s):

    phone_number = re.compile(r'''^\s*\(?       # Begining of string, Ignore leading spaces
                                  ([0-9]{3})    # Area code
                                  \)?\s*|-?     # Match 0 or 1 ')' followed by 0 or more spaces or match a single hyphen
                                  ([0-9]{3})    # Three digit
                                  -?            # hyphen
                                  ([0-9]{4})    # four digits
                                  \s*$          # End of string. ignore trailing spaces''', re.VERBOSE)
    try:
        return (phone_number.match(s).groups())
    except AttributeError as e:
        raise ValueError

I was failing this test case ' (404) 555-1212 ' but another question of SO suggest me to replace \)?\s*|-? by (?:\)?\s*|-?) and it works. The problem is that I don't understand the difference between both nor the purpose of (?:...) further than create non-capturing groups. The docs aren't clear enough for me as well.

https://docs.python.org/3/library/re.html

Juan David
  • 2,389
  • 4
  • 23
  • 35
  • `(?:...)` can be used to structure your regular expression but unlike the normal parenthesis it will not create a capture group. – Klaus D. Sep 30 '17 at 20:44
  • The question mark character, ?, matches either once or zero times; you can think of it as marking something as being optional. For example, home-?brew matches either homebrew or home-brew. – Fady Saad Sep 30 '17 at 20:44
  • @FadySaad That's not my question. I'm asking why my function passes when I use `(?:\)?\s*|-?)` instead `\)?\s*|-?` – Juan David Sep 30 '17 at 20:50

2 Answers2

3

Consider a simpler example:

re.compile(r'(?:a|b)*')

which simply matches a (possibly empty) string of as and bs. The only difference between this and

re.compile(r'(a|b)*')

is that the matching engine will capture the first character matched for retrieval with the group method. Using a non-capture group is just an optimization to speed up the match (or at least save memory) when a capture group isn't needed.

chepner
  • 389,128
  • 51
  • 403
  • 529
2

You have an alternate token in the part you replaced. Alternate will match either what's before the token, or what's after. And since separating a regex into lines like you've done here isn't considered grouping, it would try to match not just what's before or after on the same line, but on the lines before and after as well.

Grouping should instead be done by surrounding the group in parentheses, BUT by default this will also "capture" the group, meaning it will return the match as one of the groups when you call groups(). To specify that it should not, you need to add ?:.

glennsl
  • 23,127
  • 11
  • 49
  • 65