I know it can't be perfect but I am not very good with regex and I'm having difficulties getting a better matching percentage.
I have a file that has over 9 million rows and the addresses are very inconsistent. I was wondering if I could get some help from the people here that are better than me. Any help would be greatly appreciated.
This is what I have so far. I thought the best way to attack this would be to try to match the pattern from the end of the string since apt,bx, po box, etc could be at the start of the string.
/(\d+\-\d+\s+|\d+-\D+|APT\s\D|APT\s\d+|APT\s\D\d+|APT\s\D\s\d+|SPACE\s\d+|POBOX\s\d+|BX|UNIT\s\d+|\d+-\d+|\d+)\s(.+)\s{2,}(\D+)\s(\D{2})$/
Several patterns that I can see. The large number of spaces is as in the file. I tried splitting on 2 spaces or more as well as in the regex I have thus far.
F_NAME L_NAMEFOR F_NAME L_NAME ADDRESS ZIP CITY STATE
ADDRESS CITY STATE
ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY STATE
APT # ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY STATE
P O BOX # ADDRESS CITY STATE
APT DIGIT# ADDRESS CITY STATE
SPACE DIGIT ADDRESS CITY STATE
UNIT # ADDRESS CITY STATE
SP DIGIT ADDRESS CITY STATE
DIGITS-DIGITS ADDRESS CITY STATE
BX DIGIT ADDRESS CITY STATE
ADDRESS APT # CITY STATE
ADDRESS UNIT # CITY STATE
ADDRESS P O BOX DIGIT CITY STATE
P O B O X DIGIT CITY STATE
P O BOX DIGIT CITY STATE
ADDRESS SPACE/SP/SPC/UNIT DIGIT CITY STATE