Home > Mobile >  Extract the first complete address from string
Extract the first complete address from string

Time:06-29

I am unsure how to tell a regular expression in Python to stop after finding the first match.

Apparently you can tell regex to be lazy, RegEx - stop after first match , I tried placing (.*?) at the end of my expression but that just broke it. I just want it to stop after finding the first complete address and return that.

Sample code with data: https://regexr.com/6okuv

In the sample data all addresses are accepted by the expression except "Hindenburgdamm 27, Hygiene-Institut" where it should stop after "27" and return "Hindenburgdamm 27" and "Peschkestr. 5a/Holsteinische Str. 44" where it should stop after "5a" and return "Peschkestr. 5a".

Regex expression : 
^([A-Za-zÄäÖöÜüß\s\d.-] ?)\s*([\d\s] (?:\s?[- /]\s?\d )?\s*[A-Za-z]?-?[A-Za-z]?)?$

Sample data:
Berliner Str. 74
Hindenburgdamm 27, Hygiene-Institut
Peschkestr. 5a/Holsteinische Str. 44
Lankwitzer Str. 13-17a
Fidicinstr. 15A
Haudegen Weg 15/17
Johanna-Stegen-Strasse 14a-d
Friedrichshaller Str. 7
Südwestkorso 9

CodePudding user response:

You could make the pattern a bit more specific for the digits and the trailing characters, and match at least a single digit using a case insensitive match:

^([A-ZÄäÖöÜüß.\s-] ?)\s*(\d (?:[/-]\d )?(?:[A-Z](?:-[A-Z])?)?)\b

Explanation

  • ^ Start of string
  • ([A-ZÄäÖöÜüß.\s-] ?) Capture group 1
  • \s* Match optional whitespace chars
  • ( Capture group 1
    • \d Match 1 digits
    • (?:[/-]\d )? Optionally match / - and 1 digits
    • (?:[A-Z](?:-[A-Z])?)? Optionally match A-Z followed by an optional - and A-Z
  • ) Close group 2
  • \b A word boundary

Regex demo

If you want a match only and don't need the capture groups you can omit them.

Note that in the leading character class there are chars like ., - and \s If the match should not start with any of these characters you can start with a character class without those following by an optionally repeated character class to still match at least 1 character.

^[A-ZÄäÖöÜüß][A-ZÄäÖöÜüß.\s-]*?\s*\d (?:[/-]\d )?(?:[A-Z](?:-[A-Z])?)?\b

Regex demo

CodePudding user response:

You can try this pattern

^([A-Za-zÄäÖöÜüß\s\d.-] ?\s[0-9a-zA-zÄäÖöÜüß-] ?)[\s\/,]?

In any case if you don't expect to match the full line don't use the $ to expect the regular expression to reach EOL.

  • Related