0

How to extract the city from the address using regular expression. For example 'Houston' from the format' 113 victoria st, Houston, TX'.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
Dejia Lu
  • 9
  • 1
  • [a-zA-Z0-9\s]+,\s+([a-zA-Z\s]+),[a-zA-Z\s]+ – Almog Feb 07 '20 at 00:15
  • 1
    If the data you're working on is clean, regex can work well. The comment above should provide one. If the data you're working on is not clean - maybe ocr'd from handwritten files, you might want to look into an NLP library. https://github.com/openvenues/libpostal is one example. – bbbbbb Feb 07 '20 at 00:20
  • Obligatory mention if you're trying to process addresses with a simple regex: [Falsehoods programmers believe about addresses](https://gist.github.com/almereyda/85fa289bfc668777fe3619298bbf0886) – Nate Eldredge Feb 07 '20 at 02:37

2 Answers2

0

In this case:

113 victoria st, Houston, TX

the city is whatever is between the penultimate comma (optionally followed by one or more spaces) and the final comma.

And the final comma is whatever precedes the two capital letters (indicating the state).

So:

.+\,\s*([^\,]+)\,\s*[A-Z]{2}$

contains a capture group from which $1 will be your city name.


Explanation of RegEx:

  • .+ - One to any number of any character
  • \, - Followed by a comma
  • \s* - Followed by zero to any number of spaces
  • [^\,]+ - Followed by any number of characters which are not a comma
  • \, - Followed by a comma
  • \s* - Followed by zero to any number of spaces
  • [A-Z]{2} - Followed by 2 capital letters
  • $- End of match

Because [^\,]+ has parentheses around it, this is what gets captured and returned as $1.

Rounin
  • 21,349
  • 4
  • 53
  • 69
0

Using lookahead and lookbehind will give you what you need:

(?<=\,\s)[a-zA-Z\s]*(?=\,\s+[A-Z]{2}$)

(?<=\,\s): The text we are looking for should be preceded by a comma and a whitespace

[a-zA-Z\s]: The text we are looking for should consist of letters and space characters only

(?=\,\s+[A-Z]{2}$): The text we are looking for should be followed by a comma, whitespace and two capital letters

Have a look in this Regex Sandbox snippet: https://regexr.com/4tpq8

Aydin4ik
  • 1,447
  • 1
  • 8
  • 15