am having pandas dataframe, it has 7 columns customer_id, user_id, year_month, values, 01,02,03 i have to multiply & add each row based on group by customer_id, user_id considering month from year_month column
Input Dataframe
df
###
customer_id user_id year_month values 01 02 03
1 2285 1 2020-01 1000 45 81 0
2 2285 1 2020-02 2000 18 18 05
3 2285 1 2020-03 6000 06 18 0
4 2285 2 2020-01 1800 45 81 0
5 2285 2 2020-02 2700 18 18 05
6 2285 2 2020-03 3600 06 18 0
7 2285 1 2019-01 6300 45 81 0
8 2285 1 2019-02 7200 18 18 05
9 2285 1 2019-03 8100 06 18 0
10 2285 1 2021-01 7272 45 81 0
11 2285 1 2021-02 6366 18 18 05
12 2285 1 2021-03 5544 06 18 0
Expected Output Dataframe
df
customer_id user_id date_month volume output_value
1 2285 1 2020-01 1000 207000
2 2285 1 2020-02 2000 84000
3 2285 1 2020-03 6000 42000
4 2285 2 2020-01 1800 207000
5 2285 2 2020-02 2700 84000
6 2285 2 2020-03 3600 42000
sample calculation should be done based on each month from date_month column for id -1 --> calculation will be
01 (1000 *45 2000*81 6000*0) =207000
02 (1000 *18 2000*18 6000*05) = 84000
03 (1000 *06 2000*18 6000*0) = 42000
i have tried below code to achieve the output_value column but am facing errors with
cols = pd.Series(["jan", "feb", "mar"])
# Pivot the "volume" column so it lines up with the "jan", "feb", "mar" columns
volumes = (
df.assign(month=df["date_month"].str[-2:])
.pivot(index="id", columns="month", values="volume")
.set_axis(cols, axis=1)
)
# Line up the 2 frames
tmp = pd.concat(
[df.set_index("id")[cols], volumes], axis=1, keys=["value", "volume"]
)
# Calculation
df["output"] = (tmp["value"] * tmp["volume"]).sum(axis=1).to_numpy()
# ValueError: cannot handle a non-unique multi-index!
CodePudding user response:
No so straightforward, there are 3 layers to achieve (pivoting, multiplying with MultiIndex, merging).
You can use:
df2 = df['year_month'].str.split('-', expand=True)
df['year'] = df2[0]
out = df.merge(
df.set_index(['customer_id', 'user_id', 'year', 'year_month'])
.mul(df.assign(month=df2[1])
.pivot(index=['customer_id', 'user_id', 'year'], columns='month', values='values')
)
.sum(1)
.rename('output_value')
.reset_index(),
how='left', on=['customer_id', 'user_id', 'year', 'year_month']
)
output:
customer_id user_id year_month values 01 02 03 year output_value
0 2285 1 2020-01 1000 45 81 0 2020 207000.0
1 2285 1 2020-02 2000 18 18 5 2020 84000.0
2 2285 1 2020-03 6000 6 18 0 2020 42000.0
3 2285 2 2020-01 1800 45 81 0 2020 299700.0
4 2285 2 2020-02 2700 18 18 5 2020 99000.0
5 2285 2 2020-03 3600 6 18 0 2020 59400.0
6 2285 1 2019-01 6300 45 81 0 2019 866700.0
7 2285 1 2019-02 7200 18 18 5 2019 283500.0
8 2285 1 2019-03 8100 6 18 0 2019 167400.0
9 2285 1 2021-01 7272 45 81 0 2021 842886.0
10 2285 1 2021-02 6366 18 18 5 2021 273204.0
11 2285 1 2021-03 5544 6 18 0 2021 158220.0
alternative format with jan/feb/mar
as column names
The logic is the same, you just need to ensure the mapping is correct
s = pd.to_datetime(df['year_month'])
df['year'] = s.dt.year
print(df.merge(
df.set_index(['customer_id', 'user_id', 'year', 'year_month'])
.mul(df.assign(month=s.dt.strftime('%b').str.lower())
.pivot(index=['customer_id', 'user_id', 'year'], columns='month', values='values')
)
.sum(1)
.rename('output_value')
.reset_index(),
how='left', on=['customer_id', 'user_id', 'year', 'year_month']
))
output:
customer_id user_id year_month values jan feb mar year output_value
0 2285 1 2020-01 1000 45 81 0 2020 207000.0
1 2285 1 2020-02 2000 18 18 5 2020 84000.0
2 2285 1 2020-03 6000 6 18 0 2020 42000.0
3 2285 2 2020-01 1800 45 81 0 2020 299700.0
4 2285 2 2020-02 2700 18 18 5 2020 99000.0
5 2285 2 2020-03 3600 6 18 0 2020 59400.0
6 2285 1 2019-01 6300 45 81 0 2019 866700.0
7 2285 1 2019-02 7200 18 18 5 2019 283500.0
8 2285 1 2019-03 8100 6 18 0 2019 167400.0
9 2285 1 2021-01 7272 45 81 0 2021 842886.0
10 2285 1 2021-02 6366 18 18 5 2021 273204.0
11 2285 1 2021-03 5544 6 18 0 2021 158220.0
CodePudding user response:
You are setting the customer_id as the index, but that one is not unique, pandas cannot handle non-unique multi-Indices. If you flatten the dataframe you will get rid of the multi-index and create a unique index at the same time. If you change the code as follows it should run:
tmp = pd.concat(
[df.set_index("id")[cols], volumes], axis=1, keys=["value", "volume"]
).reset_index(drop=True)