0

I'm new to learning regex, and I came across a problem that I solved, although I'm not sure why it was a problem and would just like to learn a bit more!

I'm using Python for my regex statement. The relevant portion of text to be captured is (I've changed the exact numbers, but this is what it looks like)

Evaluation Type: InterimContract Percent Complete: 30%Period of Performance Being Assessed: 05/27/2013 -

I'm looking to capture Interim and 05/27/2013. The regex that I was using that did NOT work was

match = re.search(
"Evaluation Type:[\s\n]*(.*?)[\s\n]*Contract Percent[.]*"
"Period of Performance Being Assessed:[\s\n]*(.*?)[\s\n]*-"
, page_content)

The code that does work is

match = re.search(
"Evaluation Type:[\s\n]*(.*?)[\s\n]*Contract Percent.*"
"Period of Performance Being Assessed:[\s\n]*(.*?)[\s\n]*-"
, page_content)

(as you may notice, the difference is that I removed the square brackets around the . at the end of line 2.

I understand that the brackets weren't actually needed (just helped me visualize it as I'm creating the regex) but I'm not sure why they broke it. I was getting no match with the first set of code, while a perfect match with the second. I'm sure it's some simple little thing, but I couldn't find what would be breaking from my searches online (although it could be that I don't understand enough in depth to know what I'm looking for)

Mark R
  • 216
  • 2
  • 9
  • 3
    Characters inside square brackets are treated as literals. Hence, `[.]*` means a literal dot, zero or more times, while `.*` means _any_ character zero or more times. It should be clear to you that the former won't match your input, while the latter (apparently) does. – Tim Biegeleisen Aug 08 '17 at 14:35
  • Your RE is overcomplicated, all of your square brackets are not needed and don't actually do what you intended them to do. In fact, this RE - `Type:\s*(.*?)\s*Contract.*Assessed:\s*(.*?)\s*-` will give you the exact same result. – Dror Av. Aug 08 '17 at 14:43
  • @WiktorStribiżew, it would be more helpful if you include that link as a comment rather than marking it as a duplicate. That compilation of links isn't very helpful unless you know what you are looking for (which beginners like me don't). If I was simply asking for regex code I could understand marking as a duplicate, but I was looking for an explanation which is something that that link can't help with (I didn't know which link would help me solve my problem because I didn't really understand what the problem was) – Mark R Aug 08 '17 at 14:54
  • You are looking for the details on the 2 regex patterns. Just go to http://regex101.com, and you will see immediately where the problem is. That post provides the link to regex101 and many other posts that help. I have added another close reason that deals with your case exactly (the link is also present in the *What does this regex mean* post). – Wiktor Stribiżew Aug 08 '17 at 14:58

1 Answers1

4
  • [.]* means 0 or more dot
  • .* means 0 or more any character but newline.

A dot inside a character class loses its special meaning.

glibdud
  • 7,131
  • 2
  • 23
  • 34
Toto
  • 83,193
  • 59
  • 77
  • 109
  • Oh that makes so much more sense now! Does this apply only to the dot or are there other characters that lose their meaning (such as ^)? – Mark R Aug 08 '17 at 14:38
  • @glibdud: Thanks for editing ;) – Toto Aug 08 '17 at 14:41
  • @MarkR See the [re documentation](https://docs.python.org/3.6/library/re.html#regular-expression-syntax); specifically, the explanation of the `[]` special characters. – glibdud Aug 08 '17 at 14:43
  • @MarkR: many special characters change their meaning when inside charcater class, `^` have another meaning, it means strat of string but in char class, it negate the conyent of the char class, `[^abc]` means any charcater that IS NOT a or b or c – Toto Aug 08 '17 at 14:43
  • Oh great, now I understand better what the documentation was saying. It is just a bit hard to understand when first trying it out – Mark R Aug 08 '17 at 14:49