Home > Back-end >  find the summation for multiple columns and keeping the column name
find the summation for multiple columns and keeping the column name

Time:12-29

Assume we have a dataframe df1 with 45 columns filled with float numbers. It has the following form(for simplicity I will use int numbers in this example)

df1

Jan 2023_a   Jan 2023_b  Jan 2023_c  Feb 2023_a Feb 2023_b Feb 2023_c  Mar 2023_a  Mar 2023_b Mar 2023_c  ...
1        2       1      3     2     5     1      2      0    ...
0        3       1      4     0     0     2      1      3

I want to create a new columns that add every previous 3 columns(in axis=1) and for a column name it keeps the Month but change the suffix. Like the following

df1

Jan 2023_a   Jan 2023_b  Jan 2023_c  Jan 2023_sum   Feb 2023_a Feb 2023_b Feb 2023_c Feb 2023_sum  Mar 2023_a  Mar 2023_b Mar 2023_c  Mar 2023_sum  ...
1        2       1      4          3     2     5     10       1      2      0     3      ...
0        3       6      9          4     0     0      4       2      1      3     6

So basically calculate the sum for each month and then place it after the corresponding columns. The tricky part is that the column names are dynamic. Meaning I cannot hardcode the column names since depending on the csv file I read it might start from January or it might start from June etc

EDIT: I updated the column names to also include the year

CodePudding user response:

totals = (df.groupby(lambda c: c.split("_")[0], axis="columns").sum()
            .add_suffix("_sum"))
new_df = df.join(totals)
  • group the frame over columns' values before "_"
  • sum the groups, and add the "_sum" suffix (e.g., to get "Feb_sum" etc.)
  • join it with the original frame

Example run:

In [135]: df
Out[135]:
   Jan_a  Jan_b  Jan_c  Feb_a  Feb_b  Feb_c  Mar_a  Mar_b  Mar_c
0      1      2      1      3      2      5      1      2      0
1      0      3      1      4      0      0      2      1      3

In [136]: totals
Out[136]:
   Feb_sum  Jan_sum  Mar_sum
0       10        4        3
1        4        4        6

In [137]: new_df
Out[137]:
   Jan_a  Jan_b  Jan_c  Feb_a  Feb_b  Feb_c  Mar_a  Mar_b  Mar_c  Feb_sum  Jan_sum  Mar_sum
0      1      2      1      3      2      5      1      2      0       10        4        3
1      0      3      1      4      0      0      2      1      3        4        4        6

We see that the *_sum columns place at the end; to bring them next to their root columns, we can sort the columns with a custom mapping:

months = new_df.columns.str.split("_").str[0]
mapper = {mon: idx for idx, mon in enumerate(months.unique())}
new_df = new_df.sort_index(axis="columns", key=lambda _: months.map(mapper), kind="stable")

to get

In [148]: months
Out[148]:
Index(['Jan', 'Jan', 'Jan', 'Feb', 'Feb', 'Feb', 'Mar', 'Mar', 'Mar', 'Feb',
       'Jan', 'Mar'],
      dtype='object')

In [149]: mapper
Out[149]: {'Jan': 0, 'Feb': 1, 'Mar': 2}

In [150]: new_df
Out[150]:
   Jan_a  Jan_b  Jan_c  Jan_sum  Feb_a  Feb_b  Feb_c  Feb_sum  Mar_a  Mar_b  Mar_c  Mar_sum
0      1      2      1        4      3      2      5       10      1      2      0        3
1      0      3      1        4      4      0      0        4      2      1      3        6

CodePudding user response:

If the months are always separated by an underscore, you can use the following code:

# Create an empty list to store the new columns
new_cols = []

# Iterate through the columns of the dataframe
for col in df1.columns:
  # Split the column name into month and suffix using the "_" character as the separator
  month, suffix = col.split("_")
  # If the suffix is "a", "b", or "c", add the column to the new_cols list
  if suffix in ["a", "b", "c"]:
    new_cols.append(col)
 
# Group the columns by month
df1_grouped = df1[new_cols].groupby(new_cols, axis=1)

# Apply the sum function to each group and assign the result to a new dataframe
df1_summed = df1_grouped.sum()

# Rename the columns of the new dataframe to include the "sum" suffix
df1_summed.columns = [f"{month}_sum" for month, _ in df1_grouped.groups.keys()]

# Concatenate the original dataframe with the new dataframe
df1 = pd.concat([df1, df1_summed], axis=1)
  • Related