I'm trying to correct address data where the street address and city/town information were appended without a space.
The easiest way to identify split points is by looking for a road type (e.g. STREET, ST.) followed by a word, for example:
1201 WEST FRONT STREETCHESTER PA 19013-3496
However some splits occur on other words, such as SOUTH. We don't need to detect these for the moment.
703 6TH STREET SOUTHTEXAS CITY TX 77590
The following regex,
(ST(?:REET)?)\.?([A-Z]{3,})
works well for most examples, but will match ST REET in the second example. IIRC regexes were greedy by default, so I don't understand why this is happening. My understanding was that the first capture group should prevent the second from triggering. I've tried rewriting the regex as (STREET|ST)\.?([A-Z]{3,})
, but this doesn't change anything.
Are there any ways of rewriting the regex or compilation flags that can help?
Solution
For those interested, using Michal's regex as a starting point, the final regex I used was:
\s((?:(?!STREET|STATE)ST|STREET|LANE|LN|(?!DRIVE)DR|DRIVE|ROAD|RD|[0-9] |(?!AVENUE)AVE|AVENUE|BOULEVARD|BLVD|HWY|HIGHWAY|WEST|EAST|(?!NORTHEAST|NORTHWEST)NORTH|(?!SOUTHEAST|SOUTHWEST)SOUTH|N\.|S\.|W\.|E\.)\.?)(?=[A-Z]{3,})
It handles Street, Lane, Drive, Avenue, Boulevard, Highway, and splits on cardinal directions for the EPA's TSCA data.
CodePudding user response:
You could use pattern:
( (?!STREET)ST|STREET)(?=[A-Z])
Explanation:
(...)
- capturing group
- maatch space literally
(?!...)
- negative lookahead assertion
STREET
- matches STREET
literally
ST
- matches ST
literally
|
- alternation opeartor
(?=...)
- positive lookahead assertion
[A-Z]
- character class - match character from range A-Z
, so any uppercse english character
Replacement pattern would be \1
, so first capturing group (either ST
or STREET
) followed by space.
CodePudding user response:
The reason that (ST(?:REET)?)\.?([A-Z]{3,})
and (STREET|ST)\.?([A-Z]{3,})
both match the single word STREET
is that the regex has to match the whole pattern, and note that the .
is optional.
As the dot is optional, the regex can match either STREET[A-Z]{3,}
or ST[A-Z]{3,}
The first can not match the word STREET only as there should be 3 or more characters after STREET, but the second pattern can match ST and 3 or more characters....matching STREET.
What you could do is match either STREET and 3 or more characters, or match ST. and 3 or more characters to not match the word STREET only
(STREET|ST\.)([A-Z]{3,})
Other possible options:
1.) Match ST and optionally match REET only if REET is not followed by a word boundary:
\bST(?!REET\b)(?:REET)?
2.) Using the regex PyPy module using an optional non capture group and a possessive quantifier followed by a non word boundary:
\bST(?:REET)? \B