Search for pattern in string and add characters if found

Question

I am working on some address cleaning/geocoding software, and I recently ran into a specfic address format that is causing some problems for me.

My external geocoding module is having trouble finding addresses such as 30 w 60th new york (30 w 60th street new york is the proper format of the address).

Essentially what I would need to do is parse the string and check the following:

Are there any numbers followed by th or st or nd or rd? (+ a space following them). I.E 33rd 34th 21st 24th
If so, is the word street following it?

If yes, do nothing.

If no, add the word street immediately after the specific pattern?

Would regex be the best way to approach this situation?

Further Clarification: I am not having any issues with other address suffixes, such as avenue, road, etc etc etc. I have analyzed very large data sets (I'm running about 12,000 addresses/day through my application), and instances where street is left out is what is causing the biggest headaches for me. I have looked into address parsing modules, such as usaddress, smartystreets, and others. I really just need to come up with a clean (hopefully regex?) solution to the specific problem that I have described.

I'm thinking something along the lines of:

Converting the string to a list.
Find the index of the element in the list that meets the criteria that i've explained
Check to see if the next element is street. If so, do nothing.
If not, reconstruct the list with [:targetword + len(targetword)] + 'street' + [:targetword + len(targetword)]. (targetword would be 47th or whatever is in the string)
Join the list back into a string.

I'm not exactly the best with regex, so i'm looking for some input.

Thanks.

Yes those are scenarios that are causing errors too. I should have included that. There are just more instances with `th`, but I would definitely need to handle those cases as well. — Harrison, Aug 10 '16 at 18:26
Don't try to do this by yourself. There is only suffering and heartache down this road. You've already seen the problem with only accounting for "th" when "st", "nd", and "rd" have to be accounted for, but what about only talking about "street" when you could also have "avenue", "boulevard", "way", "road", and a thousand others plus all the (possibly misspelled and abbreviated) variations? Get a nice address validator like the post office provides and let _it_ give you suggestions for replacements. — Two-Bit Alchemist, Aug 10 '16 at 18:28
From the inputs that I am getting `street` is the only address suffix that I have to account for. Frequently customers are inputting things such as `10 east 42nd` and it's assumed they mean `10 east 42nd street`. I've been running my application on roughly 6,000 different addresses, twice per day, and from the data, cases where `street` is left out is the only thing giving me a major headache. — Harrison, Aug 10 '16 at 18:31

score 2 · Accepted Answer · edited Jul 27 '20 at 10:48

It seems that your looking for regexp. = P

Here some code I build specialy for you :

import re


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r"(?P<number>[\d]{1,3}(st|nd|rd|th)\s)(?P<following>.*)")
    
    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # then check if not followed by 'street'
        if re.match('street', has_number.group('following')) is None:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
        else:
            return True # the format is good (followed by 'street')
    else:
        return True # there is no number like 'th, st, nd, rd'

I'm python learner so thank you for let me know if it solves your issue.

Tested on a small list of addresses.

Hope it helps or leads you to solution.

Thank you !

EDIT

Improved to take care if followed by "avenue" or "road" as well as "street" :

import re


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,3}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return True # do nothing
        # else add the "street" word
        else:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
    else:
        return True # there is no number like 'th, st, nd, rd'

RE-EDIT

I made some improvement for your needs and added an example of use :

import re


# build the original address list includes bad format
address_list = [
    '30 w 60th new york',
    '30 w 60th new york',
    '30 w 21st new york',
    '30 w 23rd new york',
    '30 w 1231st new york',
    '30 w 1452nd new york',
    '30 w 1300th new york',
    '30 w 1643rd new york',
    '30 w 22nd new york',
    '30 w 60th street new york',
    '30 w 60th street new york',
    '30 w 21st street new york',
    '30 w 22nd street new york',
    '30 w 23rd street new york',
    '30 w brown street new york',
    '30 w 1st new york',
    '30 w 2nd new york',
    '30 w 116th new york',
    '30 w 121st avenue new york',
    '30 w 121st road new york',
    '30 w 123rd road new york',
    '30 w 12th avenue new york',
    '30 w 151st road new york',
    '30 w 15th road new york',
    '30 w 16th avenue new york'
]


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return address # return original address
        # else add the "street" word
        else:
            new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
    else:
        return address # there is no number like 'th, st, nd, rd' -> return original address


# initialisation of the new list
new_address_list = []

# built the new clean list
for address in address_list:
    new_address_list.append(check_th_add_street(address))
    # or you could use it straight here i.e. :
    # address = check_th_add_street(address)
    # print address

# use the new list to do you work
for address in new_address_list:
    print "Formated address is : %s" % address # or what ever you want to do with 'address'

Will output :

Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w 1231st street new york
Formated address is : 30 w 1452nd street new york
Formated address is : 30 w 1300th street new york
Formated address is : 30 w 1643rd street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w brown street new york
Formated address is : 30 w 1st street new york
Formated address is : 30 w 2nd street new york
Formated address is : 30 w 116th street new york
Formated address is : 30 w 121st avenue new york
Formated address is : 30 w 121st road new york
Formated address is : 30 w 123rd road new york
Formated address is : 30 w 12th avenue new york
Formated address is : 30 w 151st road new york
Formated address is : 30 w 15th road new york
Formated address is : 30 w 16th avenue new york

RE-RE-EDIT

The final function : added the count parameter to re.sub()

def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return address # do nothing
        # else add the "street" word
        else:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address, 1) # the last parameter is the maximum number of pattern occurences to be replaced
            return new_address
    else:
        return address # there is no number like 'th, st, nd, rd'

Hmm. Does the `{1,3}` denote that it will handle numbers with a length of up to 3? Because it will have to handle numbers up to 4 digits long followed by th,st,nd, or rd. Also, I wouldn't use this as its own function. I'd just pass each address through this 1 by 1. What would the return statements change to? I want the address to always remain with the name `address`. — Harrison, Aug 10 '16 at 23:04
Yes the `{1,3}` handles numbers with length from 1 to 3. You can easily adapt it to `{1,4}. — JazZ, Aug 11 '16 at 07:44
See the edited answer for an exemple of use and the fix to handle 4 digits. — JazZ, Aug 11 '16 at 07:51
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/120727/discussion-between-harrison-and-adrien-leber). — Harrison, Aug 11 '16 at 17:59
how can I make it so that it only matches the first occurrence of [digits]th/st/nd/rd? There are some addresses such as `30 w 60th 2nd floor`. In cases like that I only want it to work with the 60th, not the 2nd. It needs to somehow ignore all occurrences besides the first one. Is that possible? — Harrison, Aug 11 '16 at 18:17

Darth Futuza · Answer 2 · 2018-08-29T19:58:08.697

1

While you could certainly use regex for this sort of problem, I can't help, but think that there's most likely a Python library out there that has already solved this problem for you. I've never used these, but just some quick searching finds me these:

https://github.com/datamade/usaddress

https://pypi.python.org/pypi/postal-address

https://github.com/SwoopSearch/pyaddress

PyParsing also has an address sample here you might look at: http://pyparsing.wikispaces.com/file/view/streetAddressParser.py

You might also take a look at this former question: is there a library for parsing US addresses?

Any reason you can't just use a 3rd party library to solve the problem?

edit: Pyparsing moved their url: https://github.com/pyparsing/pyparsing

edited Aug 29 '18 at 19:58

answered Aug 10 '16 at 18:33

Darth Futuza

124
8

I've looked into these. These external libraries aren't very accurate because of how poor quality our data (addresses) are in. I work for an international logistics company. Because different countries use different address formats the quality of the addresses we get is usually very poor, mainly due to them not being familiar with the fomats. I have to do a lot of the cleaning and address formatting prior to being able to send it to my geolocating module. – Harrison Aug 10 '16 at 18:36
Just how bad are the average data sets? Your example in your original question is a fairly common issue most of these libraries should be able to handle. – Darth Futuza Aug 10 '16 at 18:40
I gave an extremely simplified instance. I've been working on this project for just over 2 months. It's a large scale application. I've looked into all of those external modules related to addresses. I just need to handle these cases. My application is functioning with about 96% accuracy with data sets of ~6,000. These instances are what are causing me issues though. – Harrison Aug 10 '16 at 18:42
In that case you may as well go ahead and just write out the regex, it'll need to be long and complicated to handle all edge cases, but I don't really think you're going to find a better way then just writing the regex, especially if you want higher then 96% accuracy. I also like the suggestions from the accepted answer here if you haven't already looked: http://stackoverflow.com/questions/16413/parse-usable-street-address-city-state-zip-from-a-string – Darth Futuza Aug 10 '16 at 18:54
1

Be warned if you go the regex route your going to be looking at something like [this](http://usaddress.codeplex.com/SourceControl/changeset/view/8750#159596) Also the top answer here has some good points: http://stackoverflow.com/questions/11160192/how-to-parse-freeform-street-postal-address-out-of-text-and-into-components parsing addresses is hard and I'm not sure your going to get a simple solution to this. – Darth Futuza Aug 10 '16 at 18:58
It's not so much parsing an address as expected though. It's just matching the pattern of `numbers followed by th/st/rd/nd` and checking if the next word == 'street'. – Harrison Aug 10 '16 at 19:02
1

Pyparsing is no longer hosted on wikispaces.com. Go to https://github.com/pyparsing/pyparsing – PaulMcG Aug 27 '18 at 13:12

score 0 · Answer 3 · answered Aug 10 '16 at 18:33

0

You could possibly do this by turning each one of those strings into lists, and looking for certain groups of characters in those lists. For example:

def check_th(address):
    addressList = list(address)
    for character in addressList:
        if character == 't':
             charIndex = addressList.index(character)
             if addressList[charIndex + 1] == 'h':
                 numberList = [addressList[charIndex - 2], addressList[charIndex - 1]]
                 return int(''.join(str(x) for x in numberList))

This looks very messy, but it should get the job done, as long as the number is two digits long. However, if there are many things you need to look for, you should probably look for a more convenient and simpler way to do this.

answered Aug 10 '16 at 18:33

Sid

11
5

You could also modify this general format to get numbers of other length, or the word "street". – Sid Aug 10 '16 at 18:34
The thing is, the numbers won't always be 2 digits long. They can range from 1-4 in length. – Harrison Aug 10 '16 at 18:37
@Harrison Then, you should find the "th", "st", "nd", or "rd", and find the length of the number from the distance to the nearest space before it. – Sid Aug 10 '16 at 18:40
@Harrison However, I would recommend trying to find a cleaner solution than this; it's not that this won't work, it's just that it will get messy and complex very quickly. – Sid Aug 10 '16 at 18:41
I'm thinking something along the lines of converting the string to a list and using regex to find the position of, for example, `42nd`, then checking to see if the element in the next index is `street`, and if it's not, reconstructing the list and adding `street` after it, then joining it back into a list. – Harrison Aug 10 '16 at 18:51
@Harrison You can use listName.index(character) to find the index, and if the next set of six characters is not " street" (notice the space, it is important to leave that in), then a new string could be created that appends the characters ' ', 's', 't', 'r', 'e', 'e', 't' into the list after the number, and then continues with the rest of the previous list. – Sid Aug 10 '16 at 18:58
How could I find the index? I can't just search for the index of "th " because there are other words that end in `th`. The part i'm not sure about is matching the pattern of `numbers followed by th/rd/st/th`. – Harrison Aug 10 '16 at 19:00

Sid · Answer 4 · 2016-08-10T19:24:02.900

To check and add the word street, the following function should work as long as the street number comes before its name:

def check_add_street(address):

    addressList = list(address)

    for character in addressList:
        if character == 't':
            charIndex_t = addressList.index(character)
            if addressList[charIndex_t + 1] == 'h':
                newIndex = charIndex_t + 1
                break

        elif character == 's':
            charIndex_s = addressList.index(character)
            if addressList[charIndex_s + 1] == 't':
                newIndex = charIndex_s + 1
                break

        elif character == 'n':
            charIndex_n = addressList.index(character)
            if addressList[charIndex_n + 1] == 'd':
                newIndex = charIndex_n + 1
                break

        elif character == 'r':
            charIndex_r = addressList.index(character)
            if addressList[charIndex_r + 1] == 'd':
                newIndex = charIndex_r + 1
                break

    if addressList[newIndex + 1] != ' ' or addressList[newIndex + 2] != 's' or addressList[newIndex + 3] != 't' or addressList[newIndex + 4] != 'r' or addressList[newIndex + 5] != 'e' or addressList[newIndex + 6] != 'e' or addressList[newIndex + 7] != 't' or addressList[newIndex + 8] != ' ':
        newAddressList = []

        for n in range(len(addressList)):
            while n <= newIndex:
                newAddressList.append(addressList[n])

        newAddressList.append(' ')
        newAddressList.append('s')
        newAddressList.append('t')
        newAddressList.append('r')
        newAddressList.append('e')
        newAddressList.append('e')
        newAddressList.append('t')

        for n in range(len(addressList) - newIndex):
            newAddressList.append(addressList[n + newIndex])

        return ''.join(str(x) for x in newAddressList)

    else:
        return ''.join(str(x) for x in addressList)

This will add the word "street" if it is not already present, given that the format that you gave above is consistent.

Wow this looks really good. Will this work for any length of numbers? Also, the format might not always be exactly the same. That's why I think regex might be the best way to solve this. — Harrison, Aug 10 '16 at 19:21
@Harrison It should work for any number length, as long as the first 'st', 'nd', 'rd', or 'th' that shows up is the one that you want (i.e. the one after the number). As long as you follow one of the two formats you posted in the question, it should work. — Sid, Aug 10 '16 at 19:27

Search for pattern in string and add characters if found

4 Answers4