I am trying to create a column, that should do a calculation per product, based on multiple columns.
Logic for the calculation column:
- Calculations should be done per product
- Use quantity as default
- IF promo[Y/N] = 1, then take previous weeks quantity * season perc. change.
- Except when the promo is on the first week of the product. Then keep quantity as well.
In the example below, You see the calculation column (I placed comments for the logic).
week product promo[Y/N] quantity Season calculation
1 A 0 100 6 100 # no promo, so = quantity col
2 A 0 100 10 100 # no promo, so = quantity col
3 A 1 ? -10 90 # 100 (quantity last week)- 10% (season)
4 A 1 ? 20 108 # quantity last week, what we calculated, 90 20% (18) = 108.
5 A 0 80 4 80 # no promo, so = quantity col
1 B 1 100 6 100 # Promo, but first week of this product. So regular quantity.
2 B 0 100 10 100 # no promo, so = quantity col
3 B 1 ? -10 90 # 100 (quantity last week)- 10% (season)
4 B 1 ? 20 108 # quantity last week, what we calculated, 90 20% (18) = 108.
5 B 0 80 4 80 # no promo, so = quantity col
I tried to solve this in two ways:
Via a groupby(), and then the Product, but this was messing up my end file (I would like to have it in the format above, so with 1 additional column.
By looping over the dataframe with iterrows(). However, I messed up, because it doesn't distinct between products unfortunately.
Anyone an idea what a proper method is to solve this? Appreciated!
CodePudding user response:
groupby() is a good approach in my opinion.
Let's build our dataset first :
csvfile = StringIO(
"""week\tproduct\tpromo\tquantity\tseason
1\tA\t0\t100\t6
2\tA\t0\t100\t10
3\tA\t1\tnan\t-10
4\tA\t1\tnan\t20
5\tA\t0\t80\t4
1\tB\t1\t100\t6
2\tB\t0\t100\t10
3\tB\t1\tnan\t-10
4\tB\t1\tnan\t20
5\tB\t0\t80\t4""")
df = pd.read_csv(csvfile, sep='\t', engine='python')
I strongly encourage to not use '?' or other non numeric character to transcript the lack of data. It will be a mess in your future analysis. Instead use np.nan.
Then let's build a function hich will translate your expected behaviour for each product :
def computation(pd_input):
previous_calculation = np.nan
# output dataframe
pd_output = pd.DataFrame(index=pd_input.index, columns=["calculation"])
# we need to loop other all your row
for index, row in pd_input.iterrows():
# if promo = 0 OR for the first row, use quantity
if row.promo == 0 or np.isnan(previous_calculation):
pd_output.loc[index, "calculation"] = row.quantity
# else, use the previous value we computed and apply the percentage
else:
pd_output.loc[index, "calculation"] = (1 row.season/100) * previous_calculation
previous_calculation = pd_output.loc[index, "calculation"]
# return the pandas
return pd_output
Finally we just need to apply the previous function to our grouped dataframe :
calculation = df.groupby("product", group_keys=True).apply(lambda x: computation(x))
Let's check our ouput :
print(calculation )
calculation
product
A 0 100.0
1 100.0
2 90.0
3 108.0
4 80.0
B 5 100.0
6 100.0
7 90.0
8 108.0
9 80.0
CodePudding user response:
Using custom apply function to add column to dataframe
Code
def calc(xf):
'''
xf is dataframe from groupby
'''
calc = []
# Faster to loop over rows using zip than iterrows
for promo, quantity, season in zip(xf['promo[Y/N]'], xf['quantity'], xf['Season']):
if quantity.isnumeric():
calc.append(quantity)
elif promo and calc: # beyond first week if calc is not None
prev_quantity = float(calc[-1]) # previous quantity
estimate = round((1 season/100.)*prev_quantity) # estimate
calc.append(estimate)
else:
calc.append(quantity) # use current quantity
xf['calculated'] = calc # Add calculated column to dataframe
return xf
Test
tf = df.groupby('product').apply(calc)
print(tf)
Output
week product promo[Y/N] quantity Season calculated
0 1 A 0 100 6 100
1 2 A 0 100 10 100
2 3 A 1 ? -10 90
3 4 A 1 ? 20 108
4 5 A 0 80 4 80
5 1 B 1 100 6 100
6 2 B 0 100 10 100
7 3 B 1 ? -10 90
8 4 B 1 ? 20 108
9 5 B 0 80 4 80