I'm trying to validate, if zip code for many different countries which stored in a table are in a correct format, as an example:
ZIP | COUNTRY_CODE |
---|---|
1033 SC | NL |
60593 | DE |
To do that I have a separate DF with country code and regular expressions as a pattern for every zip code.
REGEX | COUNTRY |
---|---|
\d{4}[ ]?[A-Z]{2} | NL |
\d{5} | DE |
I'm trying to merge this to tables based on a coutry code and then create which indicates as True or False if the zip code based on regex is correct.
Here is my currecnt code:
df_merged = pd.merge(regex_df, zip_df, left_on = 'CODE', right_on= 'COUNTRY_CODE')
df_merged['zip_correct'] = df_mergedf.CODE_y.str.contains(df_merged.REGEX.str, regex= True, na=False)
Hovewer I'm etting only false results since pandas is cheking the regex pattern in every row. How could I limit it to check it row by row?
Expected output:
ZIP | COUNTRY_CODE | ZIP_CORRECT |
---|---|---|
1033 SC | NL | TRUE |
60593 | DE | TRUE |
6059TT | DE | FALSE |
Could you please help?
CodePudding user response:
I assume you might have many countries. You can apply
re.match
with the matching country regex on each row:
import re
s = regex_df.set_index('COUNTRY')['REGEX']
df['ZIP_CORRECT'] = df.apply(lambda x: bool(re.match('%s$' % s[x['COUNTRY_CODE']], x['ZIP'])), axis=1)
output:
ZIP COUNTRY_CODE ZIP_CORRECT
0 1033 SC NL True
1 60593 DE True
2 6059TT DE False
CodePudding user response:
You can use
import pandas as pd
import re
zip_df = pd.DataFrame({"ZIP":["1033 SC", "60593"], "COUNTRY_CODE": ["NL","DE"]})
regex_df = pd.DataFrame({"REGEX":[r"\d{4}[ ]?[A-Z]{2}", "\d{5}"], "COUNTRY": ["NL","DE"]})
d = dict(regex_df[['COUNTRY', 'REGEX']].to_numpy())
zip_df['ZIP_CORRECT'] = zip_df.apply(lambda x: bool(re.fullmatch(d[x['COUNTRY_CODE']], x['ZIP'])), axis=1)
The d
dictionary will look like {'NL': '\\d{4}[ ]?[A-Z]{2}', 'DE': '\\d{5}'}
, and within the .apply()
, each ZIP
value from the zip_df
dataframe is checked against the appropriate regex pattern from the dictionary as it is obtained form the d
dictionary using the zip_df
COUNTRY_CODE
value as a key.
Note that re.fullmatch
only returns True if the regex pattern fully matches the string, i.e. matches the entire string.
Test in Pandas (copy/pasted):
>>> import pandas as pd
>>> import re
>>> zip_df = pd.DataFrame({"ZIP":["1033 SC", "60593", "6059TT"], "COUNTRY_CODE": ["NL","DE","DE"]})
>>> regex_df = pd.DataFrame({"REGEX":[r"\d{4}[ ]?[A-Z]{2}", "\d{5}"], "COUNTRY": ["NL","DE"]})
>>> d = dict(regex_df[['COUNTRY', 'REGEX']].to_numpy())
>>> zip_df['ZIP_CORRECT'] = zip_df.apply(lambda x: bool(re.fullmatch(d[x['COUNTRY_CODE']], x['ZIP'])), axis=1)
>>> zip_df
ZIP COUNTRY_CODE ZIP_CORRECT
0 1033 SC NL True
1 60593 DE True
2 6059TT DE False
>>>