Home > OS >  Regex: Find text after last match of lookahead assertion until first match of lookahead assertion
Regex: Find text after last match of lookahead assertion until first match of lookahead assertion

Time:11-09

I have a text like this:

Hans Wurst, geboren 25.01.1987, zuletzt tätig als Metzger, Breslauer Str. 15, 02708 Löbau

from which I would like to get street name ("Breslauer Str. 15"). So I used a regex like the one below:

(?<=, )(. ?)(?=,[\s]?[0-9]{5})

But this is greedy and matches me:

geboren 25.01.1987, zuletzt tätig als Metzger, Breslauer Str. 15

How can I make this less greedy so that essentially it takes the latest occurence of a comma taking into account the lookahead assertion (?=,[\s]?[0-9]{5})?

CodePudding user response:

You can use

m = re.search(r'.*,\s*([^,]*),\s*[0-9]{5}', text, re.DOTALL)
if m:
    print(m.group(1))

See the regex101 result

PD: a little recommendation. if you can solve this only using python (regex is another language), do it. A workaround to this could be the following:

text = "Hans Wurst, geboren 25.01.1987, zuletzt tätig als Metzger, Breslauer Str. 15, 02708 Löba"

print(text.split(', ')[-2])
# 'Breslauer Str. 15'

This is more pythonic, easy to understand, and faster!

CodePudding user response:

Simply add [^,].* in the beginning of your RegEx pattern. The pattern, [^,].* means 'not comma' followed by any character any number of times.

RegEx Demo

Python demo:

import re

s = 'Hans Wurst, geboren 25.01.1987, zuletzt tätig als Metzger, Breslauer Str. 15, 02708 Löbau'
m = re.search(r'[^,].*(?<=, )(. )(?=,[\s]?[0-9]{5})', s)
if m:
    print(m.group(1))

Output:

Breslauer Str. 15

CodePudding user response:

In your pattern, you can just change (. ?) to [^,] . The dot can also match a comma and will match too much, the negated character class can not match a comma in this case.

As you use lookarounds, you can omit the capture group.

See a regex demo for pattern (?<=, )[^,] (?=,\s?[0-9]{5})


But as you already make use of a capture group, you can change the lookaround into matches instead to make the pattern a bit more performant.

Note that the \s does not have to be in a character class.

, ([^,] ),[\s]?[0-9]{5}\b

The pattern matches:

  • , Match literally
  • ([^,] ) Capture group 1, match 1 chars other than ,
  • ,\s? Match a comma and optional whitespace char
  • [0-9]{5}\b Match 5 digits and a word boundary to prevent a partial match

See a regex demo

import re

s="Hans Wurst, geboren 25.01.1987, zuletzt tätig als Metzger, Breslauer Str. 15, 02708 Löbau"
pattern = r", ([^,] ),[\s]?[0-9]{5}\b"
m = re.search(pattern, s)
if m:
    print (m.group(1))

Output

Breslauer Str. 15
  • Related