Suppose I have a dataframe like such:
import pandas as pd
import numpy as np
data = [[5123, '2021-01-01 00:00:00', 'cash','sales$', 105],
[5123, '2021-01-01 00:00:00', 'cash','items', 20],
[5123, '2021-01-01 00:00:00', 'card','sales$', 355],
[5123, '2021-01-01 00:00:00', 'card','items', 50],
[5123, '2021-01-02 00:00:00', 'cash','sales$', np.nan],
[5123, '2021-01-02 00:00:00', 'cash','items', np.nan],
[5123, '2021-01-02 00:00:00', 'card','sales$', 170],
[5123, '2021-01-02 00:00:00', 'card','items', 35]]
columns = ['Store', 'Date', 'Payment Method', 'Attribute', 'Value']
df = pd.DataFrame(data = data, columns = columns)
Store | Date | Payment Method | Attribute | Value |
---|---|---|---|---|
5123 | 2021-01-01 00:00:00 | cash | sales$ | 105 |
5123 | 2021-01-01 00:00:00 | cash | items | 20 |
5123 | 2021-01-01 00:00:00 | card | sales$ | 355 |
5123 | 2021-01-01 00:00:00 | card | items | 50 |
5123 | 2021-01-02 00:00:00 | cash | sales$ | NaN |
5123 | 2021-01-02 00:00:00 | cash | items | NaN |
5123 | 2021-01-02 00:00:00 | card | sales$ | 170 |
5123 | 2021-01-02 00:00:00 | card | items | 35 |
I would like to create a new attribute, called "average item price", which is generated by, for each Store/Date/Payment Method, dividing the sales$ by the items (e.g. for store 5123, 2021-01-01, cash, I would like to create a new row with an attribute called "average item price", with a value equal to 5.25).
I realize that I could pivot this data out, and have one column for sales, one column for items, and divide the two columns, then restack, but is there a better way to do this without having to pivot?
Store | Date | Payment Method | Attribute | Value |
---|---|---|---|---|
5123 | 2021-01-01 00:00:00 | cash | sales$ | 105 |
5123 | 2021-01-01 00:00:00 | cash | items | 20 |
5123 | 2021-01-01 00:00:00 | cash | average item price | 5.25 |
5123 | 2021-01-01 00:00:00 | card | sales$ | 355 |
5123 | 2021-01-01 00:00:00 | card | items | 50 |
5123 | 2021-01-01 00:00:00 | card | average item price | 7.10 |
5123 | 2021-01-02 00:00:00 | cash | sales$ | NaN |
5123 | 2021-01-02 00:00:00 | cash | items | NaN |
5123 | 2021-01-02 00:00:00 | cash | average item price | NaN |
5123 | 2021-01-02 00:00:00 | card | sales$ | 170 |
5123 | 2021-01-02 00:00:00 | card | items | 35 |
5123 | 2021-01-02 00:00:00 | card | average item price | 4.86 |
CodePudding user response:
You can use pivot_table
to get the sum of sales/items per group, then compute the average value and merge
with the original data:
s = (df.pivot_table(index=['Store', 'Date', 'Payment Method'],
columns='Attribute', values='Value', aggfunc='sum')
.assign(avg=lambda d: d['sales$']/d['items'])
['avg']
)
df.merge(s, left_on=['Store', 'Date', 'Payment Method'], right_index=True)
output:
Store Date Payment Method Attribute Value avg
0 5123 2021-01-01 00:00:00 cash sales$ 105.0 5.250000
1 5123 2021-01-01 00:00:00 cash items 20.0 5.250000
2 5123 2021-01-01 00:00:00 card sales$ 355.0 7.100000
3 5123 2021-01-01 00:00:00 card items 50.0 7.100000
4 5123 2021-01-02 00:00:00 cash sales$ NaN NaN
5 5123 2021-01-02 00:00:00 cash items NaN NaN
6 5123 2021-01-02 00:00:00 card sales$ 170.0 4.857143
7 5123 2021-01-02 00:00:00 card items 35.0 4.857143
concat
df2 = (df.pivot_table(index=['Store', 'Date', 'Payment Method'],
columns='Attribute', values='Value', aggfunc='sum')
.assign(Attribute='average item price',
Value=lambda d: d['sales$']/d['items'],
)
.reset_index()
)
(pd.concat([df,df2])
.sort_values(by=columns)
[columns]
)
Output:
Store Date Payment Method Attribute Value
0 5123 2021-01-01 00:00:00 card average item price 7.100000
3 5123 2021-01-01 00:00:00 card items 50.000000
2 5123 2021-01-01 00:00:00 card sales$ 355.000000
1 5123 2021-01-01 00:00:00 cash average item price 5.250000
1 5123 2021-01-01 00:00:00 cash items 20.000000
0 5123 2021-01-01 00:00:00 cash sales$
CodePudding user response:
pivot
and then append
after assigning the "Attribute" as needed:
pivoted = df.pivot(["Store", "Date", "Payment Method"], "Attribute", "Value")
output = (df.append(pivoted["sales$"].div(pivoted["items"])
.rename("Value").reset_index()
.assign(Attribute="average item price"), ignore_index=True)
.sort_values(["Store", "Date", "Payment Method"])
.reset_index(drop=True)
)
>>> output
Store Date Payment Method Attribute Value
0 5123 2021-01-01 00:00:00 card sales$ 355.000000
1 5123 2021-01-01 00:00:00 card items 50.000000
2 5123 2021-01-01 00:00:00 card average item price 7.100000
3 5123 2021-01-01 00:00:00 cash sales$ 105.000000
4 5123 2021-01-01 00:00:00 cash items 20.000000
5 5123 2021-01-01 00:00:00 cash average item price 5.250000
6 5123 2021-01-02 00:00:00 card sales$ 170.000000
7 5123 2021-01-02 00:00:00 card items 35.000000
8 5123 2021-01-02 00:00:00 card average item price 4.857143
9 5123 2021-01-02 00:00:00 cash sales$ NaN
10 5123 2021-01-02 00:00:00 cash items NaN
11 5123 2021-01-02 00:00:00 cash average item price NaN
CodePudding user response:
One option is to set the index, do the computation, and use categoricals to get a sorted output that matches yours:
cols = df.columns[:-1].tolist()
temp = df.set_index(cols)
# computation
summary = temp.xs('sales$', level='Attribute').div(temp.xs('items', level='Attribute'))
# add attribute to index, with a name:
summary = summary.set_index([['average item price'] * len(summary)],
append = True)
summary.index = summary.index.set_names('Attribute', level = -1)
output = pd.concat([temp, summary]).reset_index()
# create categoricals and sort:
dtype = pd.CategoricalDtype(['sales$', 'items', 'average item price'], ordered = True)
output.Attribute = output.Attribute.astype(dtype)
dtype = pd.CategoricalDtype(['cash', 'card'], ordered = True)
output['Payment Method'] = output['Payment Method'].astype(dtype)
output.sort_values(cols)
Store Date Payment Method Attribute Value
0 5123 2021-01-01 00:00:00 cash sales$ 105.000000
1 5123 2021-01-01 00:00:00 cash items 20.000000
8 5123 2021-01-01 00:00:00 cash average item price 5.250000
2 5123 2021-01-01 00:00:00 card sales$ 355.000000
3 5123 2021-01-01 00:00:00 card items 50.000000
9 5123 2021-01-01 00:00:00 card average item price 7.100000
4 5123 2021-01-02 00:00:00 cash sales$ NaN
5 5123 2021-01-02 00:00:00 cash items NaN
10 5123 2021-01-02 00:00:00 cash average item price NaN
6 5123 2021-01-02 00:00:00 card sales$ 170.000000
7 5123 2021-01-02 00:00:00 card items 35.000000
11 5123 2021-01-02 00:00:00 card average item price 4.857143