Changing column names based on several possibilities of what the original column names could be (pan-CodePudding

I am making a streamlit app where a user can upload a csv or excel file and have some analysis done.

Of course, different users will name their column names differently.

I want the program to be smart enough to deduce what column names are likely to represent what the analysis requires.

For example, I would want the program to be able to read both of these dataframes:

df1 = pd.DataFrame({'id_number':[1,2,3], 'reason_code':['TH7','JK9','PI2'], 'reason_code_description':['A','B','C'], 'name':['karen','pluto','imogen']})

df2 = pd.DataFrame({'Number (ID)':[1,2,3], 'Reason of Code':['TH7','JK9','PI2'], 'Description of Reason Code': ['A','B','C'], 'Name of User':['karen','pluto','imogen']})

Such that the program would understand that a column name that contains both the words "ID" and "Number" is the id_number column; a column name that contains the words "reason" and "code" but not "description" is the reason_code column, etc...

I think to do this, the best option is to use str.contains to identify certain substrings, and then rename those specific column names to be what the remainder of the program requires.

Here is an example of what I've tried (doesn't work, but no error is thrown):

df2.columns[(df2.columns.str.contains("reason", case=False)) & (df2.columns.str.contains("code", case=False)) & (~df2.columns.str.contains("description", case=False))].rename("reason_code",inplace=True)

Thanks in advance

CodePudding user response：

Here is one approach, create a function to rename column names.

Code

import pandas as pd


def rename_colname(name):
    """
    Define rename logic here.
    """
    cname = name.lower()
    if 'reason' in cname and 'code' in cname and not 'description' in cname:
        return 'reason_code'
    elif 'number' in cname and 'id' in cname:
        return 'id_number'
    elif 'reason' in cname and 'code' in cname and 'description' in cname:
        return 'reason_code_description'
    elif 'name' in cname and 'user' in cname:
        return 'name'

    return name


df1 = pd.DataFrame({'id_number':[1,2,3], 'reason_code':['TH7','JK9','PI2'],
                    'reason_code_description':['A','B','C'],
                    'name':['karen','pluto','imogen']})

df2 = pd.DataFrame({'Number (ID)':[1,2,3], 'Reason of Code':['TH7','JK9','PI2'],
                   'Description of Reason Code': ['A','B','C'],
                   'Name of User':['karen','pluto','imogen']})

print(f'ideal colname frame:\n{df1}')

print(f'old colname frame:\n{df2}')

df2.columns = [rename_colname(name) for name in df2.columns]
print(f'new colname frame:\n{df2}')

Output

ideal colname frame:
   id_number reason_code reason_code_description    name
0          1         TH7                       A   karen
1          2         JK9                       B   pluto
2          3         PI2                       C  imogen
old colname frame:
   Number (ID) Reason of Code Description of Reason Code Name of User
0            1            TH7                          A        karen
1            2            JK9                          B        pluto
2            3            PI2                          C       imogen
new colname frame:
   id_number reason_code reason_code_description    name
0          1         TH7                       A   karen
1          2         JK9                       B   pluto
2          3         PI2                       C  imogen