I have a text like this:
Hans Wurst, geboren 25.01.1987, zuletzt tätig als Metzger, Breslauer Str. 15, 02708 Löbau
from which I would like to get street name ("Breslauer Str. 15"). So I used a regex like the one below:
(?<=, )(. ?)(?=,[\s]?[0-9]{5})
But this is greedy and matches me:
geboren 25.01.1987, zuletzt tätig als Metzger, Breslauer Str. 15
How can I make this less greedy so that essentially it takes the latest occurence of a comma taking into account the lookahead assertion (?=,[\s]?[0-9]{5})?
CodePudding user response:
You can use
m = re.search(r'.*,\s*([^,]*),\s*[0-9]{5}', text, re.DOTALL)
if m:
print(m.group(1))
PD: a little recommendation. if you can solve this only using python (regex is another language), do it. A workaround to this could be the following:
text = "Hans Wurst, geboren 25.01.1987, zuletzt tätig als Metzger, Breslauer Str. 15, 02708 Löba"
print(text.split(', ')[-2])
# 'Breslauer Str. 15'
This is more pythonic, easy to understand, and faster!
CodePudding user response:
Simply add [^,].*
in the beginning of your RegEx pattern. The pattern, [^,].*
means 'not comma' followed by any character any number of times.
Python demo:
import re
s = 'Hans Wurst, geboren 25.01.1987, zuletzt tätig als Metzger, Breslauer Str. 15, 02708 Löbau'
m = re.search(r'[^,].*(?<=, )(. )(?=,[\s]?[0-9]{5})', s)
if m:
print(m.group(1))
Output:
Breslauer Str. 15
CodePudding user response:
In your pattern, you can just change (. ?)
to [^,]
. The dot can also match a comma and will match too much, the negated character class can not match a comma in this case.
As you use lookarounds, you can omit the capture group.
See a regex demo for pattern (?<=, )[^,] (?=,\s?[0-9]{5})
But as you already make use of a capture group, you can change the lookaround into matches instead to make the pattern a bit more performant.
Note that the \s
does not have to be in a character class.
, ([^,] ),[\s]?[0-9]{5}\b
The pattern matches:
,
Match literally([^,] )
Capture group 1, match 1 chars other than,
,\s?
Match a comma and optional whitespace char[0-9]{5}\b
Match 5 digits and a word boundary to prevent a partial match
See a regex demo
import re
s="Hans Wurst, geboren 25.01.1987, zuletzt tätig als Metzger, Breslauer Str. 15, 02708 Löbau"
pattern = r", ([^,] ),[\s]?[0-9]{5}\b"
m = re.search(pattern, s)
if m:
print (m.group(1))
Output
Breslauer Str. 15