Assume we have a dataframe df1 with 45 columns filled with float numbers. It has the following form(for simplicity I will use int numbers in this example)
df1
Jan 2023_a Jan 2023_b Jan 2023_c Feb 2023_a Feb 2023_b Feb 2023_c Mar 2023_a Mar 2023_b Mar 2023_c ...
1 2 1 3 2 5 1 2 0 ...
0 3 1 4 0 0 2 1 3
I want to create a new columns that add every previous 3 columns(in axis=1) and for a column name it keeps the Month but change the suffix. Like the following
df1
Jan 2023_a Jan 2023_b Jan 2023_c Jan 2023_sum Feb 2023_a Feb 2023_b Feb 2023_c Feb 2023_sum Mar 2023_a Mar 2023_b Mar 2023_c Mar 2023_sum ...
1 2 1 4 3 2 5 10 1 2 0 3 ...
0 3 6 9 4 0 0 4 2 1 3 6
So basically calculate the sum for each month and then place it after the corresponding columns. The tricky part is that the column names are dynamic. Meaning I cannot hardcode the column names since depending on the csv file I read it might start from January or it might start from June etc
EDIT: I updated the column names to also include the year
CodePudding user response:
totals = (df.groupby(lambda c: c.split("_")[0], axis="columns").sum()
.add_suffix("_sum"))
new_df = df.join(totals)
- group the frame over columns' values before "_"
- sum the groups, and add the "_sum" suffix (e.g., to get "Feb_sum" etc.)
- join it with the original frame
Example run:
In [135]: df
Out[135]:
Jan_a Jan_b Jan_c Feb_a Feb_b Feb_c Mar_a Mar_b Mar_c
0 1 2 1 3 2 5 1 2 0
1 0 3 1 4 0 0 2 1 3
In [136]: totals
Out[136]:
Feb_sum Jan_sum Mar_sum
0 10 4 3
1 4 4 6
In [137]: new_df
Out[137]:
Jan_a Jan_b Jan_c Feb_a Feb_b Feb_c Mar_a Mar_b Mar_c Feb_sum Jan_sum Mar_sum
0 1 2 1 3 2 5 1 2 0 10 4 3
1 0 3 1 4 0 0 2 1 3 4 4 6
We see that the *_sum columns place at the end; to bring them next to their root columns, we can sort the columns with a custom mapping:
months = new_df.columns.str.split("_").str[0]
mapper = {mon: idx for idx, mon in enumerate(months.unique())}
new_df = new_df.sort_index(axis="columns", key=lambda _: months.map(mapper), kind="stable")
to get
In [148]: months
Out[148]:
Index(['Jan', 'Jan', 'Jan', 'Feb', 'Feb', 'Feb', 'Mar', 'Mar', 'Mar', 'Feb',
'Jan', 'Mar'],
dtype='object')
In [149]: mapper
Out[149]: {'Jan': 0, 'Feb': 1, 'Mar': 2}
In [150]: new_df
Out[150]:
Jan_a Jan_b Jan_c Jan_sum Feb_a Feb_b Feb_c Feb_sum Mar_a Mar_b Mar_c Mar_sum
0 1 2 1 4 3 2 5 10 1 2 0 3
1 0 3 1 4 4 0 0 4 2 1 3 6
CodePudding user response:
If the months are always separated by an underscore, you can use the following code:
# Create an empty list to store the new columns
new_cols = []
# Iterate through the columns of the dataframe
for col in df1.columns:
# Split the column name into month and suffix using the "_" character as the separator
month, suffix = col.split("_")
# If the suffix is "a", "b", or "c", add the column to the new_cols list
if suffix in ["a", "b", "c"]:
new_cols.append(col)
# Group the columns by month
df1_grouped = df1[new_cols].groupby(new_cols, axis=1)
# Apply the sum function to each group and assign the result to a new dataframe
df1_summed = df1_grouped.sum()
# Rename the columns of the new dataframe to include the "sum" suffix
df1_summed.columns = [f"{month}_sum" for month, _ in df1_grouped.groups.keys()]
# Concatenate the original dataframe with the new dataframe
df1 = pd.concat([df1, df1_summed], axis=1)