Home > Back-end >  Regex: Match words at end of line but do not include X
Regex: Match words at end of line but do not include X

Time:04-15

I am trying to get the span of the city name from some addresses, however I am struggling with the required regex. Examples of the address format is below.

flat 1, tower block, 34 long road, Major city

flat 1, tower block, 34 long road, town and parking space

34 short road, village on the river and carpark (7X3 8RG)

The expected text to be captured in each case is "Major city", "town" and "village on the river". The issue is that sometimes "and parking space" or a variant is included in the address. Using a regex such as "(?<=,\s)\w " would return "town and parking space" in the case of example 2.

The city is always after the last comma of the address.

I have tried to re-work this question but have not successfuly managed to exclude the "and parking space" section.

I have already created a regex that excludes the postcodes this is just included as an answer would ideally allow for that part of the regex to be bolted on the end.

How would I create a regex that starts after the last comma and runs to the end of the address but stops at any "and parking" or postcodes?

CodePudding user response:

You can capture these strings using

,\s*((?:(?!\sand\s)[^,])*)(?=[^,]*$)
,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)
.*,\s*((?:(?!\sand\s)[^,])*)
.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)

See this regex demo or this regex demo.

Details:

  • , - a comma ]
  • \s* - zero or more whitespaces
  • ((?:(?!\sand\s)[^,])*) - Group 1: any char other than a comma, zero or more occurrences, that does not start whitespace and whitespace char sequence
  • (?=[^,]*$) - there must be any zero or more chars other than a comma till end of string.

In Python, you would use

m = re.search(r'.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)', text)
if m:
    print(m.group(1))

See the demo:

import re
texts = ['flat 1, tower block, 34 long road, Major city',
'flat 1, tower block, 34 long road, town and parking space',
'34 short road, village on the river and carpark (7X3 8RG)']
rx = re.compile(r'.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)')
for text in texts:
    m = re.search(rx, text)
    if m:
        print(m.group(1))

Output:

Major city
town
village on the river

CodePudding user response:

I would do:

import re 

exp = ['flat 1, tower block, 34 long road, Major city',
'flat 1, tower block, 34 long road, town and parking space',
'34 short road, village on the river and carpark (7X3 8RG)']

for e in (re.split(',\s*', x)[-1] for x in exp):
    print(re.sub(r'(?:\s and car.*)|(?:\s and parking.*)','',e))

Prints:

Major city
town
village on the river

Works like this:

  1. Split the string on ,\s* and take the last portion;
  2. Remove anything from the end of that string that starts with the specified (?:\s and car.*)|(?:\s and parking.*)

You can easily add addition clauses to remove with this approach.

  • Related