Regex to extract acronyms

Question

I am using regex to extract acronyms(only specific types) from text in python.

ABC (all caps within round brackets or square brackets or between word endings)
A.B.C (same as above but having only one '.' in between)
A&B&C (same as above but having only one '&' in between)

So far I am using

text = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
re.findall('\\b[A-Z][A-Z.&]{2,7}\\b', text)

Output is : ['STEVE', 'I.A', 'B&W', 'B&&W', 'I...A']
I want to exclude B&&W and I..A, but include (IA).

I am aware of the below links but I am unable to use them correctly. Kindly help.

Extract acronyms patterns from string using regex

Finding Acronyms Using Regex In Python

RegEx to match acronyms

Possible duplicate of [Learning Regular Expressions](https://stackoverflow.com/questions/4736/learning-regular-expressions) — Biffen, Nov 05 '18 at 07:40
That is very broad thread @Biffen My problem is very specific — Prince, Nov 05 '18 at 08:17

SQB · Answer 1 · 2018-11-05T09:34:10.527

5

What you want is a capital followed by a bunch of capitals, with optional dots or ampersands in between.

re.findall('\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b', text)

Breaking it down:

All back slashes are doubled because they need escaping
\b word border
[A-Z] capital
(?: opening a non-capturing group
[\.&] character class containing . and &
? optional
[A-Z] followed by another capital
) closing non-capturing group of an optional . or &, followed by a capital
{1,7} repeating that group 1 - 7 times
\b word border

We want a non-capturing group since re.findall returns groups (if present).

There are better ways of matching capitals that work across all of the Unicode characters.

This does match B&WW and B&W.W, since we do not enforce the use of the (same) character every time. If you want that, the expression gets a bit more complex (though not much).

edited Nov 05 '18 at 09:34

answered Nov 05 '18 at 06:36

SQB

3,583
1
24
44

Hi @SQB it just catches the E from 'STEVE' and A from 'Army(IA)'. I think its just catching the last characters of matches within length {1, 7} `Output - [('E', ''), ('A', '')] ` – Prince Nov 05 '18 at 06:43
Ah, of course. The grouping is interfering. `re.findall` returns groups if present, so I changed the group to non-capturing. I also changed another group to a character class as it should've been in the first place. – SQB Nov 05 '18 at 07:30
This look good. Thanks @SQB Just to clarify the **?:** in `(?:[` is for non-grouping and **?** in `&]?[A` is for 0 or 1 matches. Am I correct?? Could you also suggest some good references to master regex.. – Prince Nov 05 '18 at 08:24

score 2 · Accepted Answer · answered Nov 05 '18 at 08:24

I suggest

\b[A-Z](?=([&.]?))(?:\1[A-Z])+\b

See the regex demo

Pattern details

\b - word boundary
[A-Z] - an uppercase letter
(?=([&.]?)) - a positive lookahead that contains a capturing group that captures into Group 1 an optional & or . char
(?:\1[A-Z])+ - one or more occurrences of
- \1 - same char captured into Group 1 (so, you won't get A.T&W)
- [A-Z] - an uppercase letter
\b - word boundary.

Python demo:

import re
rx = r"\b[A-Z](?=([&.]?))(?:\1[A-Z])+\b"
s = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
print( [x.group() for x in re.finditer(rx, s)] )
# => ['STEVE', 'I.A', 'IA', 'B&W']

Regex to extract acronyms

2 Answers2