How to write a regex pattern to extract the location from AMT withdrawal Transactions in a bank statement

Question

I wish to write a regex pattern to extract the address or location from a string of narration for the data of 350k records.

txn_add <- data.frame(NARRATION=c("$ $ $ +YBL PATAUDI CHOWK \ $",
                  "$ $ -ATM CASH 83181 + MAIN BHAWANA ROAD NEW DELHI $",
                  "$ $ [5839/P1TNDE06/+RAGHUBARPURA $",
                  "$ MAXIMUMOUTFITS PRIVATE LIMITED } $ ATDELHIIN- $ $ /5631 $",
                  "$ ATM CASH-N4077800-+SPRINGFIELDCOLONYFFAR IDABADHRIN-04/06/18 $ /5631 ( $ $ VERIFICATION $"))

I ran the following regex pattern:

gsub(".*[:|+]([^.]+)[$|\\|\\/].*", "\\1", txn_add$NARRATION)

And i got the output as :

[1] "YBL PATAUDI CHOWK  "                                                  
[2] " MAIN BHAWANA ROAD NEW DELHI "                                        
[3] "RAGHUBARPURA "                                                        
[4] "$ MAXIMUMOUTFITS PRIVATE LIMITED } $ ATDELHIIN- $ $ /5631 $"          
[5] "SPRINGFIELDCOLONYFFAR IDABADHRIN-04/06/18 $ /5631 ( $ $ VERIFICATION "

This output is not correct as I have to implement some conditions: Address can start from :

1. '+'
2. '@'
3. ' AT '
4. ':'
5. <P|S><SBI><P|S>              # EXACT TEXT PRECEEDED AND FOLLOWED BY PUNCTUATION OR SPACE
6. <NNN> FOLLOWED BY <P|S|A>    # 3 NUMBERS FOLLOWED BY EITHER PUNCTUATION OR SPACE OR ALPHA

And End with :

1. -
2. / 
3. $
4. \
5.<NNNNNNN>     # Combination of numbers

CAN CONTAIN

Alphabets, numbers, dot (.), dash (-),space ( ), coma(,),underscore (_) brackets(()) at (@), hash (#) and(&) semi colon (;)

This is to extract the address from the transaction & Desired Output will be:

[1] "YBL PATAUDI CHOWK"                                                  
[2] "MAIN BHAWANA ROAD NEW DELHI "                                        
[3] "RAGHUBARPURA "                                                        
[4] "DELHIIN"          
[5] "SPRINGFIELDCOLONYFFAR IDABADHRIN"

I am not able to get the desired output. What can I try next?

The fourth bird · Answer 1 · 2021-01-28T08:57:16.253

0

You might use a capture group

(?:[+@:]|\bAT(?!M))\s*([A-Z]+(?:\s+[A-Z]+)*)

Explanation

(?: Non capture group
- [+@:] Match one of + @ :
- | Or
- \bAT(?!M) Match AT not followed by M
) Close group
\s* Match 0+ whitespace chars
( Capture group 1
- [A-Z]+(?:\s+[A-Z]+)* Match chars A-Z with 1+ whitespace chars in between
) Close group 1

See a regex demo

With sub matching all before and after the group:

sub(".*(?:[+@:]|\\bAT(?!M))\\s*([A-Z]+(?:\\s+[A-Z]+)*).*", "\\1", txn_add$NARRATION, perl=TRUE)

edited Jan 28 '21 at 08:57

answered Jan 28 '21 at 08:11

The fourth bird

96,715
14
35
52

Thanks sir, this running fine for some of the cases but its is not able to handle some of the cases like : "$ USHAHIGHWAYFILLINGSFARIDABADIN-23/08 /18 $ /5631 (Ref# $ VERIFICATION $", for this the desired output is " USHAHIGHWAYFILLINGSFARIDABADIN" – Piyush Sharma Jan 29 '21 at 07:45
@PiyushSharma Try it like this `.*?(?|(?:[+@:]| AT(?!M\b))\s*([A-Z]+(?:\s+[A-Z]+)*)|\b([A-Z]+(?:\s+[A-Z]+)*)-\d).*` See https://ideone.com/fKlnyN – The fourth bird Jan 29 '21 at 10:33
@PiyushSharma Did that work out? – The fourth bird Jan 30 '21 at 10:41

How to write a regex pattern to extract the location from AMT withdrawal Transactions in a bank statement

1 Answers1