Home > Software design >  Regex to match (extract) only words from address string
Regex to match (extract) only words from address string

Time:09-12

I have like this input address list:

St. Washington, 80
7-th mill B.O., 34
Pr. Lakeview, 17
Pr. Harrison, 15 k.1
St. Hillside Avenue, 26

How I can match only words from this addresses and get like this result:

Washington
mill
Lakeview
Harrison
Hillside Avenue

Pattern (\w ) can't help to me in my case.

CodePudding user response:

It's difficult to know what a "perfect" solution here looks like, as such input might encounter all sorts of unexpected edge cases. However, here's my initial attempt which does at least correctly handle all five examples you have given:

(?<= )[a-zA-Z][a-zA-Z ]*(?=,| )

Demo Link

Explanation:

  • (?<= ) is a look-behind for a space. I chose this rather than the more standard \b "word boundary" because, for example, you don't want the th in 7-th or the O in B.O. to be counted as a "word".
  • [a-zA-Z][a-zA-Z ]* is matching letters and spaces only, where the first matched character must be a letter. (You could also equivalently make the regex case-insensitive with the /i option, and just use a-z here.)
  • (?=,| ) is a look-ahead for a comma or space. Again I chose this rather than the more standard \b "word boundary" because, for example, you don't want the B in B.O. to be counted as a "word".
  • Related