4

Working my way through a beginners Python book and there's two fairly simple things I don't understand, and was hoping someone here might be able to help.

The example in the book uses regular expressions to take in email addresses and phone numbers from a clipboard and output them to the console. The code looks like this:

#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.

import pyperclip, re

# Create phone regex.
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?              #[1] area code
(\s|-|\.)?                      #[2] separator
(\d{3})                         #[3] first 3 digits
(\s|-|\.)                       #[4] separator
(\d{4})                         #[5] last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))?  #[6] extension
)''', re.VERBOSE)

# Create email regex.
emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+   
@                   
[\.[a-zA-Z0-9.-]+   
(\.[a-zA-Z]{2,4})   
)''', re.VERBOSE)

# Find matches in clipboard text.
text = str(pyperclip.paste())           
matches = []                             

for groups in phoneRegex.findall(text):  
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups [8] != '':
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)

for groups in emailRegex.findall(text):
    matches.append(groups[0])           

# Copy results to the clipboard.
if len(matches) > 0:                    
    pyperclip.copy('\n'.join(matches))
    print('Copied to Clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers of email addresses found')

Okay, so firstly, I don't really understand the phoneRegex object. The book mentions that adding parentheses will create groups in the regular expression.

If that's the case, are my assumed index values in the comments wrong and should there really be two groups in the index marked one? Or if they're correct, what does groups[7,8] refer to in the matching loop below for phone numbers?

Secondly, why does the emailRegex use a mixture of lists and tuples, while the phoneRegex uses mainly tuples?

Edit 1

Thanks for the answers so far, they've been helpful. Still kind of confused on the first part though. Should there be eight indexes like rock321987's answer or nine like sweaver2112's one?

Edit 2

Answered, thank you.

rsylatian
  • 321
  • 1
  • 11
  • Note that in the `#[1] area code` part the *inner* parentheses are backslash escaped - that mean's the won't act as grouping meta-characters, but instead match literal parens. – Lukas Graf May 24 '16 at 19:48
  • 2
    For future reference: http://stackoverflow.com/questions/4736/learning-regular-expressions – Fabian N. May 24 '16 at 19:50
  • Thanks Fabian, I'll have a look at that now. Lukas, I don't really understand what you mean by grouping meta-characters. Is that what's causing the formatting to turn red, or is it something that will effect the regular expression itself? – rsylatian May 24 '16 at 20:19
  • There are 9 capture groups since there are 9 pairs of unescaped round brackets. Group 0 is the whole match, groups with IDs from 1 to 9 are those capture groups. – Wiktor Stribiżew May 24 '16 at 20:32
  • its 9..actually I missed the initial one – rock321987 May 24 '16 at 20:39

3 Answers3

5

every opening left ( marks the beginning of a capture group, and you can nest them:

(                               #[1] around whole pattern
(\d{3}|\(\d{3}\))?              #[2] area code
(\s|-|\.)?                      #[3] separator
(\d{3})                         #[4] first 3 digits
(\s|-|\.)                       #[5] separator
(\d{4})                         #[6] last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))?  #[7,8,9] extension
)

You should use named groups here (?<groupname>pattern), along with clustering only parens (?:pattern) that don't capture anything. And remember, you should capture quantified constructs, not quantify captured constructs:

(?<areacode>(?:\d{3}|\(\d{3}\))?)
(?<separator>(?:\s|-|\.)?)
(?<exchange>\d{3})
(?<separator2>\s|-|\.)
(?<lastfour>\d{4})
(?<extension>(?:\s*(?:ext|x|ext.)\s*(?:\d{2,5}))?)
Scott Weaver
  • 6,328
  • 2
  • 23
  • 37
  • I could be wrong but shouldn't your indexes be zero based? – Jinjubei May 24 '16 at 19:51
  • 1
    i looked it up, @Jinjubei: Group 0 is always present; it’s the whole RE, so match object methods all have group 0 as their default argument. Later we’ll see how to express groups that don’t capture the span of text that they match. – Scott Weaver May 24 '16 at 19:52
3
(                               #[1] around whole pattern
(\d{3}|\(\d{3}\))?              #[2] area code
(\s|-|\.)?                      #[3] separator
(\d{3})                         #[4] first 3 digits
(\s|-|\.)                       #[5] separator
(\d{4})                         #[6] last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))?  #[7] extension
    <---------->   <------->
      ^^               ^^
      ||               ||
      [8]              [9]
)

Second Question

You are understanding it entirely wrong. You are mixing python with regex. In regex

[] character class (and not list)

() capturing group (and not tuple)

So whatever is inside these have nothing to do with list and tuple in python. Regex can be considered itself as a language and (), [] etc. are part of regex

Community
  • 1
  • 1
rock321987
  • 10,292
  • 1
  • 23
  • 36
2

for the first part of your question see sweaver2112's answer

for the second part, the both use lists and tuples. In Regex \d is the same as [0-9] it's just easier to write. in the same vein they could have written \w for [a-zA-Z] but that wouldn't account for special characters or 0-9 making it a little easier to put [a-zA-Z0-9.-]

Jinjubei
  • 328
  • 1
  • 11