-1

I am trying to match the letters 'C' or 'c' as they appear in a file.

They must be stand alone and NOT followed by a '+' or '.'.

The following two patterns give me the same result using Regex101, but I get a different result in the Dataquest IDE and my home PC.

The two patterns are:

pattern = r'\b[Cc]\b(?!\+|\.)'  
pattern = r"\b[Cc]\b[^.+]"

The problem line in question is: (Line 223 from the hacker_news.csv file)

MemSQL (YC W11) Raises $36M Series C

On my home PC and Dataquests IDE: The regex using the negative lookahead matches that line. The other regex does not.

On Regex101 they both match that line.

I am NOT supposed to match it.

I wrote the lookahead regex, which fails in Dataquests IDE. The non-lookahead version is their answer, which passes.

I think they should both yield the same result, but they do not.

I am running Python 3.7.6

What am I missing?

MarkS
  • 1,043
  • 15
  • 26
  • `\b[Cc]\b[^.+]` won't match on regex101 also. You probably have a line break after last `C`. See this: https://regex101.com/r/S5NH9m/3 – anubhava Dec 19 '19 at 19:56
  • I do not understand why that affects the behavior. Can you please elaborate? – MarkS Dec 19 '19 at 19:57
  • On regex101, if there is a line break after last `C` then `[^.+]` will match that line break. In your code though there is no line break hence match fails. – anubhava Dec 19 '19 at 20:05
  • The [^.+] is NOT matching that line, which is correct. Why is the lookahead regex matching it? – MarkS Dec 19 '19 at 20:12
  • Because lookahead is zero width assertion. It doesn't match, only asserts. – anubhava Dec 19 '19 at 20:17
  • 1
    I don't believe those are equal. `(?!\+|\.)` translates to "not a plus sign **or** not a period" so only one condition needs to match. Since a period is not a plus sign it passes the first condition. Whereas `[^.+]` translates to "not a period **and** not a plus sign". To make the former equivalent to the latter, I believe you need `(?![.+])` – MonkeyZeus Dec 19 '19 at 20:28
  • One important aspect of all this is that you need to figure out which version of regex Dataquests IDE uses and which version Python 3.7.6 uses. If the regex versions are different then it's understandable to see different results for the same regex. – MonkeyZeus Dec 19 '19 at 20:33

1 Answers1

-1

(?!\+|\.) is negative lookahead. It doesn't include any additional characters in the match; it simply adds a requirement to the character that precedes it that says it can't be followed by . or +. In your input string, the C at the end is not followed by one of these characters, so the match succeeds.

[^.+] matches a single character that is not a . or a +. There are no characters after the C so the match fails.

CAustin
  • 4,254
  • 11
  • 25
  • If I may, how would I need to modify the lookahead regex to match the behavior of the other? There is simply nothing after the trailing 'C'. – MarkS Dec 19 '19 at 20:17
  • I guess you could just add a `.` at the end of it so that it matches one additional character, like `\b[Cc]\b(?!\+|\.)`. Why not just use the pattern that's already working though? – CAustin Dec 19 '19 at 20:20
  • I am already using that pattern. It matches but I don't want it to. To answer your question, just trying to understand exactly what is going on. – MarkS Dec 19 '19 at 20:22
  • 1
    Sorry, I meant to write `\b[Cc]\b(?!\+|\.).` (with a dot at the end). Also, since you're only selecting one character, you should just use a character class instead of alternation: `\b[Cc]\b(?![+.]).` – CAustin Dec 19 '19 at 20:34
  • Your first new pattern, ```\b[Cc]\b(?!\+|\.).``` replicates the behavior of the non-lookahead regex. THAT is what I was looking for. – MarkS Dec 19 '19 at 20:37