How to make a calculation based on a value and multiple columns in pandas?-CodePudding

I am trying to create a column, that should do a calculation per product, based on multiple columns.

Logic for the calculation column:

Calculations should be done per product
Use quantity as default
IF promo[Y/N] = 1, then take previous weeks quantity * season perc. change.
Except when the promo is on the first week of the product. Then keep quantity as well.

In the example below, You see the calculation column (I placed comments for the logic).

week    product    promo[Y/N]   quantity    Season       calculation 
  1       A           0           100          6            100 # no promo, so = quantity col
  2       A           0           100          10           100 # no promo, so = quantity col
  3       A           1           ?            -10           90 # 100 (quantity last week)- 10% (season)
  4       A           1           ?            20           108 # quantity last week, what we calculated, 90   20% (18) = 108.  
  5       A           0           80            4            80 # no promo, so = quantity col

  1       B           1           100          6            100 # Promo, but first week of this product. So regular quantity. 
  2       B           0           100          10           100 # no promo, so = quantity col
  3       B           1           ?            -10           90 # 100 (quantity last week)- 10% (season)
  4       B           1           ?            20           108 # quantity last week, what we calculated, 90   20% (18) = 108.  
  5       B           0           80            4            80 # no promo, so = quantity col

I tried to solve this in two ways:

Via a groupby(), and then the Product, but this was messing up my end file (I would like to have it in the format above, so with 1 additional column.
By looping over the dataframe with iterrows(). However, I messed up, because it doesn't distinct between products unfortunately.

Anyone an idea what a proper method is to solve this? Appreciated!

CodePudding user response：

groupby() is a good approach in my opinion.

Let's build our dataset first :

csvfile = StringIO(
        """week\tproduct\tpromo\tquantity\tseason
        1\tA\t0\t100\t6 
        2\tA\t0\t100\t10
        3\tA\t1\tnan\t-10
        4\tA\t1\tnan\t20
        5\tA\t0\t80\t4 
        1\tB\t1\t100\t6
        2\tB\t0\t100\t10
        3\tB\t1\tnan\t-10
        4\tB\t1\tnan\t20
        5\tB\t0\t80\t4""")
df = pd.read_csv(csvfile, sep='\t', engine='python')

I strongly encourage to not use '?' or other non numeric character to transcript the lack of data. It will be a mess in your future analysis. Instead use np.nan.

Then let's build a function hich will translate your expected behaviour for each product :

def computation(pd_input):
    previous_calculation = np.nan
    # output dataframe
    pd_output = pd.DataFrame(index=pd_input.index, columns=["calculation"])

    # we need to loop other all your row
    for index, row in pd_input.iterrows():
        # if promo = 0 OR for the first row, use quantity
        if row.promo == 0 or np.isnan(previous_calculation):
            pd_output.loc[index, "calculation"] = row.quantity
        # else, use the previous value we computed and apply the percentage
        else:
            pd_output.loc[index, "calculation"] = (1   row.season/100) * previous_calculation
        previous_calculation = pd_output.loc[index, "calculation"]

    # return the pandas
    return pd_output

Finally we just need to apply the previous function to our grouped dataframe :

calculation = df.groupby("product", group_keys=True).apply(lambda x: computation(x))

Let's check our ouput :

print(calculation )

          calculation
product              
A       0       100.0
        1       100.0
        2        90.0
        3       108.0
        4        80.0
B       5       100.0
        6       100.0
        7        90.0
        8       108.0
        9        80.0

CodePudding user response：

Using custom apply function to add column to dataframe

Code

def calc(xf):
    '''
        xf is dataframe from groupby
    '''
    calc = []
    # Faster to loop over rows using zip than iterrows
    for promo, quantity, season in zip(xf['promo[Y/N]'], xf['quantity'], xf['Season']):
        if quantity.isnumeric():
            calc.append(quantity)
        elif promo and calc:                # beyond first week if calc is not None
            prev_quantity = float(calc[-1]) # previous quantity
            estimate = round((1   season/100.)*prev_quantity) # estimate
            calc.append(estimate)
        else:
            calc.append(quantity)           # use current quantity
            
    xf['calculated'] = calc                 # Add calculated column to dataframe
    return xf

Test

tf = df.groupby('product').apply(calc)
print(tf)

Output

   week product  promo[Y/N] quantity  Season calculated
0     1       A           0      100       6        100
1     2       A           0      100      10        100
2     3       A           1        ?     -10         90
3     4       A           1        ?      20        108
4     5       A           0       80       4         80
5     1       B           1      100       6        100
6     2       B           0      100      10        100
7     3       B           1        ?     -10         90
8     4       B           1        ?      20        108
9     5       B           0       80       4         80