Groupby several columns, summing them up based on the presence of a sub-string-CodePudding

Context: I'm trying to sum all values based in a list only if they start with or contain a string

So with a config file like this:

{
        'exclude_granularity':True,
        'granularity_suffix_list':['A','B']
}

And a dataframe like this:

tt = pd.DataFrame({'A_2':[1,2,3],'A_3':[3,4,2],'B_4':[5,2,1],'B_1':[8,2,1],'C_3':[2,4,2})

How can I group by if they all start by a given substring present on the granularity_suffix_list?

Desired output:

   A   B   C_3
0  4   13   2
1  6   4    4
2  5   2    2

Attempts: I was trying this:

if exclude_granularity == True:
    def correct_categories(cols):
        return [cat if col.startswith(cat) else col for col in cols for cat in granularity_suffix_list]
    df= df.groupby(correct_categories(df.columns),axis=1).sum()

But It doesn't work. Instead, the function returns a list like ['A_2','A','A_3','A',B_4','B'...]

Thank you

CodePudding user response：

Okay finally managed to solve what I wanted

Posting the solution if anyone finds it relevant

tt = pd.DataFrame({'A_2':[1,2,3],'A_3':[3,4,2],'B_4':[5,2,1],'B_1':[8,2,1],'C_3':[2,4,2]})

granularity_suffix_list = ['A','B']

def correct_categories(cols_to_aggregate):
    lst = []
    for _, column in enumerate(cols_to_aggregate):
        if not column.startswith(tuple(granularity_suffix_list)):
            lst.append(column)
        else:
            lst.append(granularity_suffix_list[
                [i for i, w in enumerate(granularity_suffix_list) if column.startswith(w)][0]
            ])
    return lst

df = tt.groupby(correct_categories(tt.columns),axis=1).sum()

CodePudding user response：

You could write that a bit more compact:

def grouper(c):
    for suffix in granularity_suffix_list:
        if c.startswith(suffix):
            return suffix
    return c

df = tt.groupby(grouper, axis=1).sum()

Or if you're not opposed to using re:

import re

re_suffix = re.compile("|".join(map(re.escape, granularity_suffix_list)))
def grouper(c):
    return m[0] if (m := re_suffix.match(c)) else c

df = tt.groupby(grouper, axis=1).sum()

Another option would be:

pat = f"^({'|'.join(granularity_suffix_list)})"
suffixes = tt.columns.str.extract(pat, expand=False)
df = tt.groupby(suffixes.where(suffixes.notna(), tt.columns), axis=1).sum()