I'm trying to calculate the standard deviation of values from other columns, but also based on another column value, like this:
Consider this sample dataframe:
product | january | february | march | april | sales_months |
---|---|---|---|---|---|
prod1 | 3 | 6 | 7 | 1 | 4 |
prod2 | 0 | 0 | 5 | 14 | 2 |
prod3 | 2 | 4 | 67 | 1 | 4 |
The sales_months column indicates how many months of sales the product has had. So, if a product has only 2 months of sales (because it's a new product, as is the case with the second row) we only use the data for the last 2 months to get the standard deviation. If it has a 4 in the sales_months column, it indicates it has been sold for the whole 4 month period, so we can use the data of all 4 columns to get the standar deviation. So, I would like to get something like this:
product | january | february | march | april | sales_months | std_dev |
---|---|---|---|---|---|---|
prod1 | 3 | 6 | 7 | 1 | 4 | 2.38 |
prod2 | 0 | 0 | 5 | 14 | 2 | 4.5 |
prod3 | 2 | 4 | 67 | 1 | 4 | 28.02 |
How can I do this in for all rows of the dataframe. I tried building a function and applying it the dataframe, with my limited pandas knowledge, but I just caused the kernel to die:
def get_std_dev(row):
std_dev = 0.0
months = row.SALES_MONTHS
if months < 4:
m_tmp = int(months)
std_dev = df_co_info.iloc[:, 1:m_tmp].std(axis=1)
else:
std_dev = df_co_info.iloc[:, 1:4].std(axis=1)
return std_dev
df_co_info['stdDev'] = df_co_info.apply(get_std_dev, axis = 1)
EDIT 1 - Strange behaviour with 0s in columns
If I follow the numpy masks answer, I'm getting a weird behaviour when there are 0s in the month columns (all other rows seem ok)
Consider this real row from the dataframe (with all 12 months) | product | january | february | march | april | may | june | july | august | september | october | november | december | sales_months | std_dev | |:---- |:---- |:------:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| -----:| | prod1 | 0 | 0 | 0 | 0 | 1007 | 2 | 2 | 6 | 0 | 3 | 202 | 0 | 8 | 66.963218 |
I'm getting a 66.96, but I should be getting a much higher value (329.43). Since we should be only considering sales on and after may, because of the 8 months of sales...
Gah, not sure why the table is not showing... Here's the row in an uglier format:
product prod1
january 0
february 0
march 0
april 0
may 1007
june 2
july 2
august 6
september 0
october 3
november 202
december 0
sales_months 8
std_dev 66.96
CodePudding user response:
Using numpy
and a mask
:
df2 = df.drop(columns=['product', 'sales_months'])
a = df2.to_numpy()
mask = np.arange(a.shape[1]) >= a.shape[1]-df['sales_months'].to_numpy()[:,None]
df['std_dev'] = df2.where(mask).std(axis=1)
output:
product january february march april sales_months std_dev
0 prod1 3 6 7 1 4 2.753785
1 prod2 0 0 5 14 2 6.363961
2 prod3 2 4 67 1 4 32.357379
masked df2
:
january february march april
0 3.0 6.0 7 1
1 NaN NaN 5 14
2 2.0 4.0 67 1
CodePudding user response:
You can replace the 0's with NaN
and stddev should ignore them:
df.replace(0, np.NaN)