0

I know there are many questions asked on this topic. I am trying to parse and fetch street addresses from html page. The format of these page do not follow any patterns. Can someone help me in comming up with a regex that would match a street address, irrespective of the number of tags between them? Are there any other ways to do this other than using regular expressions?

Nemin
  • 1,607
  • 6
  • 22
  • 37

2 Answers2

2

Before you get all traditional let me share my experience. I've parsed over 1 million web pages in this way in Java. When I need small pieces out of a page it is perfect when paired with a replace to strip tags. In fact, it is more efficient and faster, especially when using Java's great replaceAll() function to strip tags. Build a fork join pool of both and test some parsing, you won't believe your eyes. I've added that part at the end. This is not the full regex but a starting point since it would take some trial and error to build. I believe the statement was, a bunch of pages with no clear route to the address.

So, yes, there are ways. What follows is a bit of an introduction to thinking about this in regex.

Words and groups of words are always in a pattern otherwise they aren't readable. Still, there are several things to note. Addresses can very greatly so it is important to continue building out a regex. The next thing, if you have access to a CAS engine, use it for anything you get. It standardizes your address.

As a must, have you tried xml, it will narrow everything and can help get rid of tags before you format. You need to narrow everything. If you are using java or python, run this step in a ForkJoinPool or MultiprocessingPool.

Your process should be:

  1. Narrow if possible
  2. Execute a regex that exploits formatting

Lastly, here is a regex cheat sheet.

Keep in mind. I don't know what websites you are using or their formats. I have personally had to pull this data with different per site regexes but that was for odd formats and other issues present with websites that run like databases of a certain variety.

That said, an address has a format of numbers, then street address and apartment number of pretty much anything, then city, state, then zip code. Basically it is \d+ then any combination of letters and numbers.

So (in java with double backslashes) to start you off:

[\\d]+[A-Za-z0-9\\s,\\.]+

If you want to start at but exclude tags to narrow your search if not using xml, use:

(?<=start)[\\d]+[A-Za-z0-9\\s,\\.]+?(?=end)

Html pages always seem to have tags so that would be something like

(?<=>)[\\d]+[A-Za-z0-9\\s,\\.]+?(?=<) 

You may be able to use a zip code as your ending place if there is a multi-part zipcode.

[\\d]+[A-Za-z0-9\\s,\\.]+?[\\d\\-]+

As a final note, you can chain together regexes with a pipe delimeter, e.g.:

(?<=start)[\\d]+[A-Za-z0-9\\s,\\.]+?[\\d\\-]+|(?<=start)[A-Za-z0-9\\s,\\.]+?(?=end)

If this is not narrow enough there are several additional steps:

  1. compare your results (average word length and etc.) and throw out any great outliers
  2. write a formatter script per site to do cleanup that uses single or multi-threading to replace what you don't need.

You will probably need to strip out html as well. Run this regex in a replace statement to do that.

<.*?>

If you have trouble, use something like my regex tester (the website not my own) to build your regex.

Andrew Scott Evans
  • 893
  • 11
  • 24
  • Unfortunately this answer, though detailed, makes *a lot* of assumptions. – Matt Dec 07 '13 at 04:38
  • Also, a public service announcement: [please don't use a regex to strip HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)... – Matt Dec 07 '13 at 04:48
  • ... 'course now I feel bad because I didn't talk about why this answer is good! And it's Christmastime! I'm not going to be that grinch. So: your answer is basically spot-on until the 6th paragraph where you start describing address formats. That's where things get complicated. But your advice about using XML and threading can be very effective for more accurate, efficient results, when done properly. And props for the effort in describing the regex thinking process. – Matt Dec 07 '13 at 06:37
  • Sorry to say it ladies and gentlement but Java is here to save the day. Over one million pages served. I've been building similar regexes for a while though i would need to study addresses a bit to build something more proper. The answer is simple, use java .net or .nio and let your regexes rip on the strings. It was actually running faster than xml and 100% efficient. The old xml v. html thing is pretty dead at such a small level i guess. It knocked my skepticism out of thepark. I've parsed over 1 million pages no problem this way. my job in part is to build databases based on web data. – Andrew Scott Evans Dec 07 '13 at 08:48
2

Having worked on this problem quite extensively at SmartyStreets, I will tell you "NO" to parsing/finding street addresses with a regex.

Addresses are not a regular language and cannot be matched by a regular expression.

To solve the problem, we developed an API which actually finds and extracts addresses, with notably high accuracy. It's free for low-volume use. (It was not an easy problem to solve.) You can try it for free on the homepage demo. And no, this is not a solicitation. If you want to learn more about street addresses in any amount of detail from very basic to very technical, just email us because we want to educate the community about addresses.

To extract addresses, there are regular expressions under the hood, but results are biased strongly toward those which actually verify, meaning which actually exist. In other words, this is a parser performing complex operations to find and match addresses.

This answer to a very similar question is related, and you may find it useful. The other answers highlight some important points about the difficulties and solutions for parsing street addresses...

enter image description here

Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
Matt
  • 19,570
  • 12
  • 62
  • 104