I am trying to update DataFrame column names if a substring exists within the column name, however I only need to update the substring, retaining all information either side of it.
Example:
import pandas as pd
lookup = [('abc_param_one_new', 'param_one'),
('abc_param_two_new', 'param_two'),
('abc_param_three_new', 'param_three')]
d = {'p1_abc_param_ONE_new1': [1,2],
'p2_abc_param_one_new2': [45,3],
'p3_abc_Param_One_new3': [76,5],
'p4_abc_param_two_new1': [4321,6],
'p5_abc_param_Two_new2': [3,2],
'p6_abc_param_THREE_new1': [6,5]}
df = pd.DataFrame(d)
# make all lowercase
df.columns = [x.lower() for x in df.columns]
I am trying to use the lookup var (this could be a different format though) to search and replace the substring within each column name.
e.g.
'p1_abc_param_ONE_new1' -> 'p1_param_one1'
I can do this primitively looping over the lookup and the checking for each column header, but can't manage to replace the substring.
Additionally it feels that the nested loop approach is not very efficient. Is there a more pythonic way to do this perhaps using built in pandas methods and/or list comprehensions?
CodePudding user response:
You can use a regex and str.replace
:
dic = dict(lookup)
pat = '|'.join(dic)
df.columns = df.columns.str.replace(pat, lambda x: dic.get(x.group(0)), regex=True)
output (df.columns
):
Index(['p1_param_one1', 'p2_param_one2', 'p3_param_one3', 'p4_param_two1',
'p5_param_two2', 'p6_param_three1'],
dtype='object')
CodePudding user response:
You can use series.replace
with replacement dictionary
repl = {fr'(?i){k}': v for k, v in lookup}
df.columns = df.columns.to_series().replace(repl, regex=True)
p1_param_one1 p2_param_one2 p3_param_one3 p4_param_two1 p5_param_two2 p6_param_three1
0 1 45 76 4321 3 6
1 2 3 5 6 2 5