3

I want to extract 'Physics' from the text below, But I am getting value 'None'.

Can you please explain what is the error in my code?

My logic for regex was as follows

--> ^[A-Z] - for matching any word's first character starting with CAPITAL LETTER.

--> [a-z]+ - for matching any subsequent 1 or more characters

import re
text = "111   PCM   Physics"
print(re.search(r'^[A-Z][a-z]+', text))
Omkar
  • 2,085
  • 3
  • 15
  • 27
Regneva
  • 31
  • 3
  • 1
    Change your regex to `[A-Z][a-z]+$` – Pushpesh Kumar Rajwanshi May 06 '19 at 10:08
  • @PushpeshKumarRajwanshi Can you please explain why my regex logic is wrong? – Regneva May 06 '19 at 10:10
  • `^` indicates the regex should search from the *beginning*. If you want to match `Physics` here (without matching `PCM` mind you', use [A-z]+$ – zero May 06 '19 at 10:10
  • You've selected start anchor `^` which means your match needs to start from the very start of line and then needs to match one capital letter because you have `[A-Z]` and further it needs to match one or more lower case letters because you have this `[a-z]+`. But you don't want your match from start of line, instead end of line. So you need to use end of line anchor `$` and that way it will match a word starting with capital letter and end at end of line. So you're a little mistaken in how `^` works. Hope it is clear now. Let me know for any further queries. – Pushpesh Kumar Rajwanshi May 06 '19 at 10:13

3 Answers3

1

If you want a regex pattern to find the last capitalized word in the text, then use this:

[A-Z][a-z]+$

That being said, there is a caveat here with re.match. Because we are invoking the final anchor $, therefore re.match will try to match the entire input string, so we should use this code:

text = "111   PCM   Physics"
m = re.match(r'^.*([A-Z][a-z]+)$', text)
print(m.group(1))

But note that we just as easily could have used re.split here, and split the input text on spaces:

parts = re.split(r'\s+', text)
print(parts[2])
Tim Biegeleisen
  • 387,723
  • 20
  • 200
  • 263
1

Below Code worked for me to extract "Physics" from given text.

text = "111 PCM Physics"
if "Physics" in text:
    print("Yes, Physics present in given text ")
    s = text.find("Physics")
    print(text[s:s+7]) # 7 is for lenght of "Physics"
else:
    print( "No, Physics does not present in given text " )
Omkar
  • 2,085
  • 3
  • 15
  • 27
1

It is because, the ^ in the pattern checks to see if the [A-Z] is at the start of the string. Therefore, in the given input text text = "111 PCM Physics" the starting string is 111 and hence the pattern could not match word Physics.

As per documentation,

^ Matches at the beginning of lines.

Which means that when a pattern starts with ^, then the compiler looks for strings that starts with the pattern proceeding it. For example, in r'^[A-Z][a-z]+' [A-Z] itself will match the starting capital letters such as Physics, Ankit since the ^ precedes it.

You could consider the below without the caret symbol. This will match the capitalized words anywhere in the input text.

pattern = r'[A-Z][a-z]+'
Swadhikar
  • 1,817
  • 1
  • 17
  • 30