Regex Splitting Incorrectly Formatted Addresses-CodePudding

I'm trying to correct address data where the street address and city/town information were appended without a space.

The easiest way to identify split points is by looking for a road type (e.g. STREET, ST.) followed by a word, for example:

1201 WEST FRONT STREETCHESTER PA 19013-3496

However some splits occur on other words, such as SOUTH. We don't need to detect these for the moment.

703 6TH STREET SOUTHTEXAS CITY TX 77590

The following regex,

(ST(?:REET)?)\.?([A-Z]{3,})

works well for most examples, but will match ST REET in the second example. IIRC regexes were greedy by default, so I don't understand why this is happening. My understanding was that the first capture group should prevent the second from triggering. I've tried rewriting the regex as (STREET|ST)\.?([A-Z]{3,}), but this doesn't change anything.

Are there any ways of rewriting the regex or compilation flags that can help?

Solution

For those interested, using Michal's regex as a starting point, the final regex I used was:

\s((?:(?!STREET|STATE)ST|STREET|LANE|LN|(?!DRIVE)DR|DRIVE|ROAD|RD|[0-9] |(?!AVENUE)AVE|AVENUE|BOULEVARD|BLVD|HWY|HIGHWAY|WEST|EAST|(?!NORTHEAST|NORTHWEST)NORTH|(?!SOUTHEAST|SOUTHWEST)SOUTH|N\.|S\.|W\.|E\.)\.?)(?=[A-Z]{3,})

It handles Street, Lane, Drive, Avenue, Boulevard, Highway, and splits on cardinal directions for the EPA's TSCA data.

CodePudding user response：

You could use pattern:

( (?!STREET)ST|STREET)(?=[A-Z])

Explanation:

(...) - capturing group

- maatch space literally

(?!...) - negative lookahead assertion

STREET - matches STREET literally

ST - matches ST literally

| - alternation opeartor

(?=...) - positive lookahead assertion

[A-Z] - character class - match character from range A-Z, so any uppercse english character

Replacement pattern would be \1 , so first capturing group (either ST or STREET) followed by space.

Regex demo

CodePudding user response：

The reason that (ST(?:REET)?)\.?([A-Z]{3,}) and (STREET|ST)\.?([A-Z]{3,}) both match the single word STREET is that the regex has to match the whole pattern, and note that the . is optional.

As the dot is optional, the regex can match either STREET[A-Z]{3,} or ST[A-Z]{3,}

The first can not match the word STREET only as there should be 3 or more characters after STREET, but the second pattern can match ST and 3 or more characters....matching STREET.

What you could do is match either STREET and 3 or more characters, or match ST. and 3 or more characters to not match the word STREET only

(STREET|ST\.)([A-Z]{3,})

Regex demo

Other possible options:

1.) Match ST and optionally match REET only if REET is not followed by a word boundary:

\bST(?!REET\b)(?:REET)?

Regex demo

2.) Using the regex PyPy module using an optional non capture group and a possessive quantifier followed by a non word boundary:

\bST(?:REET)? \B

Regex demo