I've been tasked with building a parser that will parse a particular web-page, so that our employees can do a bulk import of their user data into their web-site with our company.
I've utilized the HtmlAgilityPack to parse the page, I've correlated the table row
and table data
to be pushed into properties in my Map
class.
However one column is causing me alot of grief. The Address column, is the thorn in my side for an assortment of reasons.
Sample Data:
6313 SW 203rd Ave <br> Portland, OR 97224
16600 Lomita Way <br> El Dorado Hills, CA 95762
PO Box #42 <br> Hampton Bays, NY 11946
Each one of those addresses is wrapped like so (Obviously the addresses may vary based on the customer whom we are importing users for):
<tr>
<td> 6313 SW 203rd Ave <br> Portland, OR 97224 </td>
</tr>
I'm trying to implement a Regular Expression to split this in the proper area, so it may be assigned to the correlating properties:
public string Unit { get; set; }
public string Street { get; set; }
public string City { get; set; }
public string State { get; set; }
public string Zip { get; set; }
However the addresses don't provide much to anchor off of:
Issue One:
If I anchor off the <br>
then I'm only separating the lines. Doesn't fully split into proper segments.
Issue Two: Same issue with the individual comma.
Issue Three: If I anchor to numeric values, for the Zip may be invalid for Canada and may split incorrectly based on street name.
What is the best way to separate items for an address? With Regex?
`, detect that and use as indicator of street address, then take the rest, split in `,` to get your city, then space to get state and zip? Obviously your first part of *
* will give you the unit and street by taking first space. – Jason Apr 25 '14 at 20:09