Home > other >  validate multiple rows with multiple regex from two pandas df
validate multiple rows with multiple regex from two pandas df

Time:01-13

I'm trying to validate, if zip code for many different countries which stored in a table are in a correct format, as an example:

ZIP COUNTRY_CODE
1033 SC NL
60593 DE

To do that I have a separate DF with country code and regular expressions as a pattern for every zip code.

REGEX COUNTRY
\d{4}[ ]?[A-Z]{2} NL
\d{5} DE

I'm trying to merge this to tables based on a coutry code and then create which indicates as True or False if the zip code based on regex is correct.

Here is my currecnt code:

df_merged = pd.merge(regex_df, zip_df,  left_on = 'CODE',   right_on= 'COUNTRY_CODE')
df_merged['zip_correct'] = df_mergedf.CODE_y.str.contains(df_merged.REGEX.str, regex= True, na=False)

Hovewer I'm etting only false results since pandas is cheking the regex pattern in every row. How could I limit it to check it row by row?

Expected output:

ZIP COUNTRY_CODE ZIP_CORRECT
1033 SC NL TRUE
60593 DE TRUE
6059TT DE FALSE

Could you please help?

CodePudding user response:

I assume you might have many countries. You can apply re.match with the matching country regex on each row:

import re

s = regex_df.set_index('COUNTRY')['REGEX']

df['ZIP_CORRECT'] = df.apply(lambda x: bool(re.match('%s$' % s[x['COUNTRY_CODE']], x['ZIP'])), axis=1)

output:

       ZIP COUNTRY_CODE  ZIP_CORRECT
0  1033 SC           NL         True
1    60593           DE         True
2   6059TT           DE        False

CodePudding user response:

You can use

import pandas as pd
import re

zip_df = pd.DataFrame({"ZIP":["1033 SC", "60593"], "COUNTRY_CODE": ["NL","DE"]})
regex_df = pd.DataFrame({"REGEX":[r"\d{4}[ ]?[A-Z]{2}", "\d{5}"], "COUNTRY": ["NL","DE"]})

d = dict(regex_df[['COUNTRY', 'REGEX']].to_numpy())
zip_df['ZIP_CORRECT'] = zip_df.apply(lambda x: bool(re.fullmatch(d[x['COUNTRY_CODE']], x['ZIP'])), axis=1)

The d dictionary will look like {'NL': '\\d{4}[ ]?[A-Z]{2}', 'DE': '\\d{5}'}, and within the .apply(), each ZIP value from the zip_df dataframe is checked against the appropriate regex pattern from the dictionary as it is obtained form the d dictionary using the zip_df COUNTRY_CODE value as a key.

Note that re.fullmatch only returns True if the regex pattern fully matches the string, i.e. matches the entire string.

Test in Pandas (copy/pasted):

>>> import pandas as pd
>>> import re
>>> zip_df = pd.DataFrame({"ZIP":["1033 SC", "60593", "6059TT"], "COUNTRY_CODE": ["NL","DE","DE"]})
>>> regex_df = pd.DataFrame({"REGEX":[r"\d{4}[ ]?[A-Z]{2}", "\d{5}"], "COUNTRY": ["NL","DE"]})
>>> d = dict(regex_df[['COUNTRY', 'REGEX']].to_numpy())
>>> zip_df['ZIP_CORRECT'] = zip_df.apply(lambda x: bool(re.fullmatch(d[x['COUNTRY_CODE']], x['ZIP'])), axis=1)
>>> zip_df
       ZIP COUNTRY_CODE  ZIP_CORRECT
0  1033 SC           NL         True
1    60593           DE         True
2   6059TT           DE        False
>>> 
  •  Tags:  
  • Related