What would be the right regex to capture Dutch postal code with some data 'dirtiness'?-CodePudding

I'm having trouble extracting some data properly and into a SQL db, which i'm hoping someone can help with. The data is scraped from a website and represents Dutch postal codes and cities

The official Dutch postal codes are composed of 4 digits, a space in between, and a 2 capital letter addition. For example

1234 AA
or
5523 ZB

The website combines that with the name of the city in one HTML string. In other words this is what it looks like ideally:

1234 AA Amsterdam

The first problem that pops up is that the websites allows users to ignore the addition when plugging in a postal code

1234 Amsterdam

And the second problem is that some of the cities have names that do not begin with a standard Capital letter:

5555 AD 's-Hertogenbosch
or
2244 's-Gravenzande

I'm essentially trying to split everything, regardless of what special conditions pop up

What i would like to get in the end is each entry separated into 3 parts The postal 4 digit code, the postal addition (if any), and the name of the city (including the weird symbols)

For example: Column 1:

5555

Column 2:

AD or null

Column 3:

's-Hertogenbosch

I'm fairly new to Regex, so the only thing i could come up with was

r"^([^(a-z)]*).*"g

which admittedly is terrible.

Any help on the answer or resources that might help me get to one would be greatly appreciated!

CodePudding user response：

Maybe this can help you, I have tested it a little bit with the list data. I split x on spaces and then check if the second part has 2 capital letters. with the regex pattern [A-Z]{2}. [A-Z] Means any capital letter, and {2} means that exactly 2 of those character must be found.

But in AAA also 2 of capital letters can be found.

The regex function search returns a Match object when a match has been found and otherwise None. If you turn it into a boolean with bool(), that's either True or False. This way you can add the check on the length with the len-function to it.

import re

data = ['1234 AA Amsterdam', '1234 Amsterdam',
        '1234 Den Haag', '1234 AB Den Haag',
        "1234 AA 's Gravenhage", "1234 AA 's-Gravenhage",
        "1234 's Gravenhage", "1234 's-Gravenhage",
        "1234 De Bilt", '1234 AA De Bilt']

for x in data:
    y = x.split(' ')
    col1 = y[0]
    if bool(re.search('[A-Z]{2}',y[1])) and len(y[1])==2:
        col2 = y[1]
        col3 = ' '.join(y[2:])
    else:
        col2 = None
        col3 = ' '.join(y[1:])
    
    
    print(x)
    print(col1)
    print(col2)
    print(col3)
    print('----')

I tested the pattern with:

import re

tests = ['AX', 'Ax', 'ax', 'aX', 'AAA']
for test in tests:
    print(bool(re.search('[A-Z]{2}',test)) and len(test)==2)

This is a nice summary for regex in Python: https://www.w3schools.com/python/python_regex.asp

CodePudding user response：

You could use 3 groups to match 4 digits, then optionally 2 uppercase chars and capture the rest in the last group.

\b(?P<digit>\d{4})(?: (?P<addition>[A-Z]{2}))? (?P<city>\S.*)

The pattern matches:

\b A word boundary
(?P<digit>\d{4}) Capture 4 digits
(?: (?P<addition>[A-Z]{2}))? Optionally match a space and capture 2 uppercase chars A-Z
(?P<city>\S.*) Capture a space, a non whitespace char and the rest of the line

Regex demo

For example

import re

pattern = r"\b(?P<digit>\d{4})(?: (?P<addition>[A-Z]{2}))? (?P<city>\S.*)"

s = ("1234 AA 5523 ZB\n"
            "1234 AA Amsterdam\n"
            "1234 Amsterdam\n"
            "5555 AD 's-Hertogenbosch\n"
            "2244 's-Gravenzande")

print([m.groupdict() for m in re.finditer(pattern, s)])

Output

[{'addition': 'AA', 'city': '5523 ZB', 'digit': '1234'},
 {'addition': 'AA', 'city': 'Amsterdam', 'digit': '1234'},
 {'addition': None, 'city': 'Amsterdam', 'digit': '1234'},
 {'addition': 'AD', 'city': "'s-Hertogenbosch", 'digit': '5555'},
 {'addition': None, 'city': "'s-Gravenzande", 'digit': '2244'}]