-2

I'm trying to learn Python from ''Automate the Boring Stuff with Python'' and I came across a program which I don't clearly understand.

phone_regex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                # area code
    (\s|-|\.)?                        # separator
    (\d{3})                           # first 3 digits
    (\s|-|\.)                         # separator
    (\d{4})                           # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?    # extension
    )''', re.VERBOSE)



matches = []
for groups in phone_regex.findall(text):
    print('here')
    phone_number = '-'.join([groups[1], groups[3], groups[5]])
    if groups[8] != '':
        phone_number += ' x' + groups[8]
        matches.append(phone_number)

This is obviously not the whole thing, but I don't understand how IS there a groups[8] when there are only 6 groups in the regex. Additionally, while I do know the first group, group[0], is supposedly the first one, I don't really understand how...Does it just work like that, when there's one big tuple containing multiple tuples, the big one is considered the first one when indexed ?

This is obviously not the whole thing, but I don't understand how IS there a groups[8] when there are only 6 groups in the regex. Additionally, while I do know the first group, group[0], is supposedly the first one, I don't really understand how...Does it just work like that, when there's one big tuple containing multiple tuples, the big one is considered the first one when indexed ?

Also, how do the for loops here work exactly ? What are they looping ? I thought adding something like groups=groups[0:] was necessary for iterations to actually be different from one another in cases like this...

Thanks in advance.

1 Answers1

0

Note that some capturing groups have embedded capturing groups, which are also counted.

Hence:

  • (\d{3}|\(\d{3}\))? - is group 1.
  • \(\d{3}\) - is not an embedded group, since both parentheses are quoted and are treated literally.
  • separator, first 3 digits, second separator and last 4 digits are groups 2 thru 5.
  • (\s*(ext|x|ext.)\s*(\d{2,5}))? - is group 6 (with 2 embedded groups).
  • (ext|x|ext.) - is group 7.
  • (\d{2,5}) - is group 8.

To see what has been captured by each group, run:

text = '123-456-9876 ext 22, (123)-456-9876'
for groups in phone_regex.findall(text):
    print(f'Whole match: {groups[0]}')
    n = 1
    for grp in groups[1:]:
        print(f'{n}: {grp:5}', end=', ')
        n += 1
    print()

The result I got was:

Whole match: 123-456-9876 ext 22
1: 123  , 2: -    , 3: 456  , 4: -    , 5: 9876 , 6:  ext 22, 7: ext  , 8: 22   , 
Whole match: (123)-456-9876
1: (123), 2: -    , 3: 456  , 4: -    , 5: 9876 , 6:      , 7:      , 8:      , 

And answering to your question in the comment: I think you should avoid costructs like groups = groups[1:]. Syntactically it is correct, but this way you attempt to mess around with the results of what has been found. You can read groups[1:] like in my code above but don't attempt to save there anything.

Valdi_Bo
  • 24,530
  • 2
  • 17
  • 30
  • I understand now, thanks a lot ! I do have another question though, about the four loop. Shouldn't there be something like group=group[1:] ( I know this isn't the correct syntax, but you get the point ), which would progress the loop actually ? The way it is now, each iteration seems to be the same when nothing changes... – WilliamFrog8 Jun 02 '20 at 16:37