Remap values in a Pandas column based on dictionary key/value pairs using RegEx in replace() functio-CodePudding

I have the following Pandas dataframe:

foo = {
    "first_name" : ["John", "Sally", "Mark", "Jane", "Phil"],
    "last_name" : ["O'Connor", "Jones P.", "Williams", "Connors", "Lee"],
    "salary" : [101000, 50000, 56943, 330532, 92750],
}

df = pd.DataFrame(foo)

I'd like to be able to validate column data using a RegEx pattern, then replace with NaN if the validation fails.

To do this, I use the following hard-coded RegEx patterns in the .replace() method:

df[['first_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('[^0-9 ]', np.NaN, regex=True)

This approach works. But, I have 15-20 columns. So, this approach is going to be difficult to maintain.

I'd like to set up a dictionary that looks as follows:

regex_patterns = {
    'last_name' : '[^A-Za-z \/\-\.\']',
    'first_name' : '[^A-Za-z \/\-\.\']',
    'salary' : '[^0-9 ]'
}

Then, I'd like to pass a value to the .replace() function based on the name of the column in the df. It would look as follows:

df[['first_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)

How would I reference the name of the df column, then use that to look up the key in the dictionary and get its associated value?

For example, look up first_name, then access its dictionary value [^A-Za-z \/\-\.\'] and pass this value into .replace()?

Thanks!

P.S. if there is a more elegant approach, I'm all ears.

CodePudding user response：

One approach would be using columns attribute:

regex_patterns = {
    'last_name' : '[^A-Za-z \/\-\.\']',
    'first_name' : '[^A-Za-z \/\-\.\']',
    'salary' : '[^0-9 ]'
}
for column in df.columns:
  df[column] = df[[column]].replace(regex_pattern[column], np.NaN, regex=True)

CodePudding user response：

You can actually pass a nested dictionary of the form {'col': {'match': 'replacement'}} to replace

In your case:

d = {k:{v:np.nan} for k,v in regex_patterns.items()}

df.replace(d, regex=True)