How can I extract addresses and phone number from HTML?

Question

Is there a library that specializes in parsing such data?

marcc · Answer 1 · 2009-09-10T04:07:14.480

6

You could use something like Google Maps. Geocode the address and, if successful, Google's API will return an XML representation of the address with all of the elements separated (and corrected or completed).

EDIT:

I'm being voted down and not sure why. Parsing addresses can be a little difficult. Here's an example of using Google to do this:

http://blog.nerdburn.com/entries/code/how-to-parse-google-maps-returned-address-data-a-simple-jquery-plugin

I'm not saying this is the only way or necessarily the best way. Just a way to parse addresses on a web site.

edited Sep 10 '09 at 04:07

answered Sep 10 '09 at 03:25

marcc

12,129
7
47
59

Up-voted you. Treating "in-the-cloud" services like the Google Maps API as a library (which is what the poster asked for) is valid, IMHO. – Chris Simmons Sep 10 '09 at 04:46
maybe the downvotes are for not addressing getting the addresses from the html page in the first place? just a guess. – ysth Sep 11 '09 at 06:38
Probably downvoted (I didn't, by the way) because it's against Google's TOS to do that unless you are displaying a map to the user. – Matt Feb 28 '12 at 06:06

score 2 · Answer 2 · answered Sep 10 '09 at 08:10

There are 2 parts to this: extract the complete address from the page, and parse that address into something you can use (store the various parts in a DB for example).

For the first part you will need a heuristic, most likely country-dependant: for US addresses [A-Z][A-Z],?\s*\d\d\d\d\d should give you the end of an address, provided the 2 letters turn out to be a state. Finding the beginning of the string is left as an exercise.

The second part can be done either through a call to Google maps, or as usual in Perl, using a CPAN module: Lingua::EN::AddressParse (test it on your data to see if it works well enough for you).

In any case this is a difficult task, and you will most likely never get it 100% right, so plan for manually checking the addresses before using them.

score 0 · Answer 3 · answered Sep 20 '09 at 00:17

You don't need regular expressions (yet) or a general parser like pyparsing (at all). Look at something like Beautiful Soup, which will parse even bad HTML into something like a tree of tags. From there, you can look at the source of the page, and find out what tags to drill down through to get to the data. Then, from Beautiful Soup's tree, you can search for these nodes using XPath (in recent versions), and directly loop over the tags you're interested in, getting to the actual data easily. From there, you can parse the data using a quick regex or something. This will be more flexible and more future proof, and also possibly less head-exploding, than just trying to do it in pure regular expressions.

How can I extract addresses and phone number from HTML?

3 Answers3