Home > Enterprise >  How to calculate new column based on values from multiple columns
How to calculate new column based on values from multiple columns

Time:08-02

I'm trying to calculate the standard deviation of values from other columns, but also based on another column value, like this:

Consider this sample dataframe:

product january february march april sales_months
prod1 3 6 7 1 4
prod2 0 0 5 14 2
prod3 2 4 67 1 4

The sales_months column indicates how many months of sales the product has had. So, if a product has only 2 months of sales (because it's a new product, as is the case with the second row) we only use the data for the last 2 months to get the standard deviation. If it has a 4 in the sales_months column, it indicates it has been sold for the whole 4 month period, so we can use the data of all 4 columns to get the standar deviation. So, I would like to get something like this:

product january february march april sales_months std_dev
prod1 3 6 7 1 4 2.38
prod2 0 0 5 14 2 4.5
prod3 2 4 67 1 4 28.02

How can I do this in for all rows of the dataframe. I tried building a function and applying it the dataframe, with my limited pandas knowledge, but I just caused the kernel to die:

def get_std_dev(row):
    std_dev = 0.0
    months = row.SALES_MONTHS
    if months < 4:
        m_tmp = int(months)
        std_dev = df_co_info.iloc[:, 1:m_tmp].std(axis=1)
    else:
        std_dev = df_co_info.iloc[:, 1:4].std(axis=1)
    return std_dev

df_co_info['stdDev'] = df_co_info.apply(get_std_dev, axis = 1)

EDIT 1 - Strange behaviour with 0s in columns

If I follow the numpy masks answer, I'm getting a weird behaviour when there are 0s in the month columns (all other rows seem ok)

Consider this real row from the dataframe (with all 12 months) | product | january | february | march | april | may | june | july | august | september | october | november | december | sales_months | std_dev | |:---- |:---- |:------:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| | prod1 | 0 | 0 | 0 | 0 | 1007 | 2 | 2 | 6 | 0 | 3 | 202 | 0 | 8 | 66.963218 |

I'm getting a 66.96, but I should be getting a much higher value (329.43). Since we should be only considering sales on and after may, because of the 8 months of sales...

Gah, not sure why the table is not showing... Here's the row in an uglier format:

product       prod1
january       0
february      0
march         0
april         0
may           1007
june          2
july          2
august        6
september     0
october       3
november      202
december      0
sales_months  8 
std_dev       66.96

CodePudding user response:

Using numpy and a mask:

df2 = df.drop(columns=['product', 'sales_months'])
a = df2.to_numpy()
mask = np.arange(a.shape[1]) >= a.shape[1]-df['sales_months'].to_numpy()[:,None]

df['std_dev'] = df2.where(mask).std(axis=1)

output:

  product  january  february  march  april  sales_months    std_dev
0   prod1        3         6      7      1             4   2.753785
1   prod2        0         0      5     14             2   6.363961
2   prod3        2         4     67      1             4  32.357379

masked df2:

   january  february  march  april
0      3.0       6.0      7      1
1      NaN       NaN      5     14
2      2.0       4.0     67      1

CodePudding user response:

You can replace the 0's with NaN and stddev should ignore them:

df.replace(0, np.NaN)

  • Related