3

I am using regex to extract acronyms(only specific types) from text in python.

  • ABC (all caps within round brackets or square brackets or between word endings)
  • A.B.C (same as above but having only one '.' in between)
  • A&B&C (same as above but having only one '&' in between)

So far I am using

text = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
re.findall('\\b[A-Z][A-Z.&]{2,7}\\b', text)

Output is : ['STEVE', 'I.A', 'B&W', 'B&&W', 'I...A']
I want to exclude B&&W and I..A, but include (IA). 

I am aware of the below links but I am unable to use them correctly. Kindly help.

Extract acronyms patterns from string using regex

Finding Acronyms Using Regex In Python

RegEx to match acronyms

Prince
  • 134
  • 12
  • Possible duplicate of [Learning Regular Expressions](https://stackoverflow.com/questions/4736/learning-regular-expressions) – Biffen Nov 05 '18 at 07:40
  • That is very broad thread @Biffen My problem is very specific – Prince Nov 05 '18 at 08:17

2 Answers2

5

What you want is a capital followed by a bunch of capitals, with optional dots or ampersands in between.

re.findall('\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b', text)

Breaking it down:

  • All back slashes are doubled because they need escaping
  • \b word border
  • [A-Z] capital
  • (?: opening a non-capturing group
  • [\.&] character class containing . and &
  • ? optional
  • [A-Z] followed by another capital
  • ) closing non-capturing group of an optional . or &, followed by a capital
  • {1,7} repeating that group 1 - 7 times
  • \b word border

We want a non-capturing group since re.findall returns groups (if present).

There are better ways of matching capitals that work across all of the Unicode characters.

This does match B&WW and B&W.W, since we do not enforce the use of the (same) character every time. If you want that, the expression gets a bit more complex (though not much).

SQB
  • 3,583
  • 1
  • 24
  • 44
  • Hi @SQB it just catches the E from 'STEVE' and A from 'Army(IA)'. I think its just catching the last characters of matches within length {1, 7} `Output - [('E', ''), ('A', '')] ` – Prince Nov 05 '18 at 06:43
  • Ah, of course. The grouping is interfering. `re.findall` returns groups if present, so I changed the group to non-capturing. I also changed another group to a character class as it should've been in the first place. – SQB Nov 05 '18 at 07:30
  • This look good. Thanks @SQB Just to clarify the **?:** in `(?:[` is for non-grouping and **?** in `&]?[A` is for 0 or 1 matches. Am I correct?? Could you also suggest some good references to master regex.. – Prince Nov 05 '18 at 08:24
2

I suggest

\b[A-Z](?=([&.]?))(?:\1[A-Z])+\b

See the regex demo

Pattern details

  • \b - word boundary
  • [A-Z] - an uppercase letter
  • (?=([&.]?)) - a positive lookahead that contains a capturing group that captures into Group 1 an optional & or . char
  • (?:\1[A-Z])+ - one or more occurrences of
    • \1 - same char captured into Group 1 (so, you won't get A.T&W)
    • [A-Z] - an uppercase letter
  • \b - word boundary.

Python demo:

import re
rx = r"\b[A-Z](?=([&.]?))(?:\1[A-Z])+\b"
s = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
print( [x.group() for x in re.finditer(rx, s)] )
# => ['STEVE', 'I.A', 'IA', 'B&W']
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397