If my dataFrame's some column string values need to be normalized with the delimiter '|'. For example, the column name's values 'a/b/c' that need to be normalized to 'a|b|c'. And 'sex' columns 'M/F' needs to be normalized to 'M|F'.
columns_to_be_normalized = ['name', 'sex']
delimiters = ['/', ';', ',']
for column in columns_to_be_normalized:
df[column] = df[column].apply(normalize)
def normalize(column_text):
for delimiter in delimiters:
normalized_column_text = re.sub(delimiter, '|', text)
if column_text != normalized_column_text:
return normalized
return column_text
My question is, how do I pass the variable delimiters into the normalize function so that I can use it in the regex? The reason I have to pass it as an argument is because the delimiters could change depending on some conditions.
CodePudding user response:
Define normalize
with a named parameter:
def normalize(column_text, delimiters=None):
if delimiters is None:
delimiters = ['/'] # define the default here
for delimiter in delimiters:
normalized_column_text = re.sub(delimiter, '|', text)
if column_text != normalized_column_text:
return normalized # this should be fixed
return column_text
Then use:
df[column] = df[column].apply(normalize, delimiters=['/', ';', ','])
Note that you don't need apply
per item though. You can directly use pandas str.replace
that takes care of the loop for you:
import re
delimiters = ['/', ';', ',']
regex = '|'.join(map(re.escape, delimiters))
df[columns_to_be_normalized] = (
df[columns_to_be_normalized].apply(lambda s: s.str.replace(regex, '|', regex=True))
)