How to sum same columns (differentiated by suffix) in pandas?-CodePudding

I have a dataframe that looks like this:

total_customers     total_customer_2021-03-31  total_purchases    total_purchases_2021-03-31
1                   10                          4                  6
3                   14                          3                  2

Now, I want to sum up the columns row-wise that are the same expect the suffix. I.e the expected output is:

total_customers      total_purchases   
11                   10                          
17                   5

The issue why I cannot do this manually is because I have 100 column pairs, so I need an efficient way to do this. Also, the order of columns is not predictable either. What do you recommend? Thanks!

CodePudding user response：

Somehow we need to get an Index of columns so pairs of columns share the same name, then we can groupby sum on axis=1:

cols = pd.Index(['total_customers', 'total_customers',
                 'total_purchases', 'total_purchases'])

result_df = df.groupby(cols, axis=1).sum()

With the shown example, we can str.replace an optional s, followed by underscore, followed by the date format (four numbers-two numbers-two numbers) with a single s. This pattern may need modified depending on the actual column names:

cols = df.columns.str.replace(r's?_\d{4}-\d{2}-\d{2}$', 's', regex=True)
result_df = df.groupby(cols, axis=1).sum()

result_df:

   total_customers  total_purchases
0               11               10
1               17                5

Setup and imports:

import pandas as pd

df = pd.DataFrame({
    'total_customers': [1, 3],
    'total_customer_2021-03-31': [10, 14],
    'total_purchases': [4, 3],
    'total_purchases_2021-03-31': [6, 2]
})

CodePudding user response：

assuming that your dataframe is called df the best solution is:

sum_costumers = df[total_costumers]   df[total_costumers_2021-03-31]
sum_purchases = df[total_purchases]   df[total_purchases_2021-03-31]
data = {"total_costumers" : f"{sum_costumers}", "total_purchases" : f"sum_purchases"}
df_total = pd.DataFrame(data=data, index=range(1,len(data)))

and that will give you the output you want

CodePudding user response：

import pandas as pd

data = {"total_customers": [1, 3], "total_customer_2021-03-31": [10, 14], "total_purchases": [4, 3], "total_purchases_2021-03-31": [6, 2]}

df = pd.DataFrame(data=data)
final_df = pd.DataFrame()

final_df["total_customers"] = df.filter(regex='total_customers*').sum(1)
final_df["total_purchases"] = df.filter(regex='total_purchases*').sum(1)

output

final_df

    total_customers   total_purchases
0   11                10
1   17                5

CodePudding user response：

Using @HenryEcker's sample data, and building off of the example in the docs, you can create a function and groupby on the column axis:

def get_column(column):
    if column.startswith('total_customer'):
        return 'total_customers'
    return 'total_purchases'

df.groupby(get_column, axis=1).sum()

   total_customers  total_purchases
0               11               10
1               17                5

CodePudding user response：

I changed the headings while coding, to make it shorter, jfi

data = {"total_c" : [1,3], "total_c_2021" :[10,14],
    "total_p": [4,3], "total_p_2021": [6,2]}


df = pd.DataFrame(data)
df["total_costumers"] = df["total_c"]   df["total_c_2021"]
df["total_purchases"] = df["total_p"]   df["total_p_2021"]

If you don't want to see other columns you can drop them

df = df.loc[:, ['total_costumers','total_purchases']]

NEW PART So I might have find a starting point for your solution! I dont now the column names but following code can be changed, İf you have a pattern with your column names( it have patterned dates, names, etc). Can you changed the column names with a loop?

df['total_customer'] = df[[col for col in df.columns if col.startswith('total_c')]].sum(axis=1)

And this solution might be helpful for you with some alterationsexample