I'm having trouble extracting some data properly and into a SQL db, which i'm hoping someone can help with. The data is scraped from a website and represents Dutch postal codes and cities
The official Dutch postal codes are composed of 4 digits, a space in between, and a 2 capital letter addition. For example
1234 AA
or
5523 ZB
The website combines that with the name of the city in one HTML string. In other words this is what it looks like ideally:
1234 AA Amsterdam
The first problem that pops up is that the websites allows users to ignore the addition when plugging in a postal code
1234 Amsterdam
And the second problem is that some of the cities have names that do not begin with a standard Capital letter:
5555 AD 's-Hertogenbosch
or
2244 's-Gravenzande
I'm essentially trying to split everything, regardless of what special conditions pop up
What i would like to get in the end is each entry separated into 3 parts The postal 4 digit code, the postal addition (if any), and the name of the city (including the weird symbols)
For example: Column 1:
5555
Column 2:
AD or null
Column 3:
's-Hertogenbosch
I'm fairly new to Regex, so the only thing i could come up with was
r"^([^(a-z)]*).*"g
which admittedly is terrible.
Any help on the answer or resources that might help me get to one would be greatly appreciated!
CodePudding user response:
Maybe this can help you, I have tested it a little bit with the list data. I split x on spaces and then check if the second part has 2 capital letters. with the regex pattern [A-Z]{2}. [A-Z] Means any capital letter, and {2} means that exactly 2 of those character must be found.
But in AAA also 2 of capital letters can be found.
The regex function search returns a Match object when a match has been found and otherwise None. If you turn it into a boolean with bool(), that's either True or False. This way you can add the check on the length with the len-function to it.
import re
data = ['1234 AA Amsterdam', '1234 Amsterdam',
'1234 Den Haag', '1234 AB Den Haag',
"1234 AA 's Gravenhage", "1234 AA 's-Gravenhage",
"1234 's Gravenhage", "1234 's-Gravenhage",
"1234 De Bilt", '1234 AA De Bilt']
for x in data:
y = x.split(' ')
col1 = y[0]
if bool(re.search('[A-Z]{2}',y[1])) and len(y[1])==2:
col2 = y[1]
col3 = ' '.join(y[2:])
else:
col2 = None
col3 = ' '.join(y[1:])
print(x)
print(col1)
print(col2)
print(col3)
print('----')
I tested the pattern with:
import re
tests = ['AX', 'Ax', 'ax', 'aX', 'AAA']
for test in tests:
print(bool(re.search('[A-Z]{2}',test)) and len(test)==2)
This is a nice summary for regex in Python: https://www.w3schools.com/python/python_regex.asp
CodePudding user response:
You could use 3 groups to match 4 digits, then optionally 2 uppercase chars and capture the rest in the last group.
\b(?P<digit>\d{4})(?: (?P<addition>[A-Z]{2}))? (?P<city>\S.*)
The pattern matches:
\b
A word boundary(?P<digit>\d{4})
Capture 4 digits(?: (?P<addition>[A-Z]{2}))?
Optionally match a space and capture 2 uppercase chars A-Z(?P<city>\S.*)
Capture a space, a non whitespace char and the rest of the line
For example
import re
pattern = r"\b(?P<digit>\d{4})(?: (?P<addition>[A-Z]{2}))? (?P<city>\S.*)"
s = ("1234 AA 5523 ZB\n"
"1234 AA Amsterdam\n"
"1234 Amsterdam\n"
"5555 AD 's-Hertogenbosch\n"
"2244 's-Gravenzande")
print([m.groupdict() for m in re.finditer(pattern, s)])
Output
[{'addition': 'AA', 'city': '5523 ZB', 'digit': '1234'},
{'addition': 'AA', 'city': 'Amsterdam', 'digit': '1234'},
{'addition': None, 'city': 'Amsterdam', 'digit': '1234'},
{'addition': 'AD', 'city': "'s-Hertogenbosch", 'digit': '5555'},
{'addition': None, 'city': "'s-Gravenzande", 'digit': '2244'}]