I have like this input address list:
St. Washington, 80
7-th mill B.O., 34
Pr. Lakeview, 17
Pr. Harrison, 15 k.1
St. Hillside Avenue, 26
How I can match only words from this addresses and get like this result:
Washington
mill
Lakeview
Harrison
Hillside Avenue
Pattern (\w )
can't help to me in my case.
CodePudding user response:
It's difficult to know what a "perfect" solution here looks like, as such input might encounter all sorts of unexpected edge cases. However, here's my initial attempt which does at least correctly handle all five examples you have given:
(?<= )[a-zA-Z][a-zA-Z ]*(?=,| )
Explanation:
(?<= )
is a look-behind for a space. I chose this rather than the more standard\b
"word boundary" because, for example, you don't want theth
in7-th
or theO
inB.O.
to be counted as a "word".[a-zA-Z][a-zA-Z ]*
is matching letters and spaces only, where the first matched character must be a letter. (You could also equivalently make the regex case-insensitive with the/i
option, and just usea-z
here.)(?=,| )
is a look-ahead for a comma or space. Again I chose this rather than the more standard\b
"word boundary" because, for example, you don't want theB
inB.O.
to be counted as a "word".