I have the following Pandas dataframe:
foo = {
"first_name" : ["John", "Sally", "Mark", "Jane", "Phil"],
"last_name" : ["O'Connor", "Jones P.", "Williams", "Connors", "Lee"],
"salary" : [101000, 50000, 56943, 330532, 92750],
}
df = pd.DataFrame(foo)
I'd like to be able to validate column data using a RegEx pattern, then replace with NaN
if the validation fails.
To do this, I use the following hard-coded RegEx patterns in the .replace()
method:
df[['first_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('[^0-9 ]', np.NaN, regex=True)
This approach works. But, I have 15-20 columns. So, this approach is going to be difficult to maintain.
I'd like to set up a dictionary that looks as follows:
regex_patterns = {
'last_name' : '[^A-Za-z \/\-\.\']',
'first_name' : '[^A-Za-z \/\-\.\']',
'salary' : '[^0-9 ]'
}
Then, I'd like to pass a value to the .replace()
function based on the name of the column in the df. It would look as follows:
df[['first_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
How would I reference the name of the df
column, then use that to look up the key in the dictionary and get its associated value?
For example, look up first_name
, then access its dictionary value [^A-Za-z \/\-\.\']
and pass this value into .replace()
?
Thanks!
P.S. if there is a more elegant approach, I'm all ears.
CodePudding user response:
One approach would be using columns
attribute:
regex_patterns = {
'last_name' : '[^A-Za-z \/\-\.\']',
'first_name' : '[^A-Za-z \/\-\.\']',
'salary' : '[^0-9 ]'
}
for column in df.columns:
df[column] = df[[column]].replace(regex_pattern[column], np.NaN, regex=True)
CodePudding user response:
You can actually pass a nested dictionary of the form {'col': {'match': 'replacement'}}
to replace
In your case:
d = {k:{v:np.nan} for k,v in regex_patterns.items()}
df.replace(d, regex=True)