I am making a streamlit app where a user can upload a csv or excel file and have some analysis done.
Of course, different users will name their column names differently.
I want the program to be smart enough to deduce what column names are likely to represent what the analysis requires.
For example, I would want the program to be able to read both of these dataframes:
df1 = pd.DataFrame({'id_number':[1,2,3], 'reason_code':['TH7','JK9','PI2'], 'reason_code_description':['A','B','C'], 'name':['karen','pluto','imogen']})
df2 = pd.DataFrame({'Number (ID)':[1,2,3], 'Reason of Code':['TH7','JK9','PI2'], 'Description of Reason Code': ['A','B','C'], 'Name of User':['karen','pluto','imogen']})
Such that the program would understand that a column name that contains both the words "ID" and "Number" is the id_number column; a column name that contains the words "reason" and "code" but not "description" is the reason_code column, etc...
I think to do this, the best option is to use str.contains to identify certain substrings, and then rename those specific column names to be what the remainder of the program requires.
Here is an example of what I've tried (doesn't work, but no error is thrown):
df2.columns[(df2.columns.str.contains("reason", case=False)) & (df2.columns.str.contains("code", case=False)) & (~df2.columns.str.contains("description", case=False))].rename("reason_code",inplace=True)
Thanks in advance
CodePudding user response:
Here is one approach, create a function to rename column names.
Code
import pandas as pd
def rename_colname(name):
"""
Define rename logic here.
"""
cname = name.lower()
if 'reason' in cname and 'code' in cname and not 'description' in cname:
return 'reason_code'
elif 'number' in cname and 'id' in cname:
return 'id_number'
elif 'reason' in cname and 'code' in cname and 'description' in cname:
return 'reason_code_description'
elif 'name' in cname and 'user' in cname:
return 'name'
return name
df1 = pd.DataFrame({'id_number':[1,2,3], 'reason_code':['TH7','JK9','PI2'],
'reason_code_description':['A','B','C'],
'name':['karen','pluto','imogen']})
df2 = pd.DataFrame({'Number (ID)':[1,2,3], 'Reason of Code':['TH7','JK9','PI2'],
'Description of Reason Code': ['A','B','C'],
'Name of User':['karen','pluto','imogen']})
print(f'ideal colname frame:\n{df1}')
print(f'old colname frame:\n{df2}')
df2.columns = [rename_colname(name) for name in df2.columns]
print(f'new colname frame:\n{df2}')
Output
ideal colname frame:
id_number reason_code reason_code_description name
0 1 TH7 A karen
1 2 JK9 B pluto
2 3 PI2 C imogen
old colname frame:
Number (ID) Reason of Code Description of Reason Code Name of User
0 1 TH7 A karen
1 2 JK9 B pluto
2 3 PI2 C imogen
new colname frame:
id_number reason_code reason_code_description name
0 1 TH7 A karen
1 2 JK9 B pluto
2 3 PI2 C imogen