Home > Net >  group and sum columns that have the same name in different cases with pandas
group and sum columns that have the same name in different cases with pandas

Time:12-01

I have a pandas dataframe with lots of columns that have what would be the same name, except the casing is not consistent. Some columns are in all caps and need to be summed with the appropriate column. How can I combine all the these columns with the same name, keeping the appropriate casing for the column name in the end?

I only want to change the case if there is another column name that is exactly the same string in all ways except for the casing. In this example I wouldn't want to change the case of "JFK", but I would want to combine the values of "CARL" with "Carl".

edit: I realized my first example table didn't have the any cases where there was a name in all caps that did not have a matching name in a different case so I added "JFK".

Example df:

Carl CARL Carl Smith David John JFK
1 3 7 4 2 9

Desired output:

Carl Carl Smith David John JFK
4 7 4 2 9

CodePudding user response:

You can groupby the columns:

import string
reduced = [string.capwords(x) for x in df]
df.groupby(reduced, axis=1).sum()

In one line:

import string
df.groupby(df.columns.map(string.capwords), axis=1).sum()

Output:

   Carl  Carl Smith  David  John
0     4           7      4     2

CodePudding user response:

You could lower case the column names (or uppercase) and then group them together:

# lower all col names for coherence
df.columns = [x.lower() for x in df.columns]
# group columns with same name
df.groupby(level=0, axis=1).sum()

CodePudding user response:

Most of the other answers here don't take into account that there are some names in all-caps that I don't want to change the case of, such as "JFK". This clarification was a later edit so it must've slipped through the cracks.

I only want to change the column name if it is a duplicate after changing all column names to lowercase e.g. "CARL" would become a duplicate with "Carl" after both were lowercased but "David", "John", and "JFK" would not be duplicates even after lowercasing.

My Solution:

cols = pd.Series(df.columns)                                                     # "Carl", "CARL", "David", "John", "JFK"
cols_lower = cols.str.lower()                                                    # "carl", "carl", "david", "john", "jfk"
cols_duplicates = cols_lower.drop(cols_lower.drop_duplicates().index)            # "carl"

names_to_group = df.loc[:, cols_lower.isin(cols_duplicates).to_numpy()].columns  # "Carl", "CARL"
new_names = names_to_group.map(string.capwords)                                  # "Carl", "Carl"

name_dict = dict(zip(names_to_group, new_names))                                 # {"Carl": "Carl", "CARL": "Carl"}  
df = df.rename(columns = name_dict)                                              
df = df.groupby(df.columns, axis=1).sum(1)
df

# Output:
   Carl  Carl Smith  David  John  JFK
0     4           7      4     2    9

  • Related