Home > Enterprise >  Create a new column based on a condition
Create a new column based on a condition

Time:12-18

I have a dataframe df, need to create a new column, which is a product of price with metric(int calculated before).

df['cost'] = df['price'] * metric if (df['status'] == 'online')
df['cost'] = 0 if df['status'] == 'offline'

CodePudding user response:

You can use np.where:

import numpy as np
df['cost'] = np.where(df['status']=='online', df['price']*metric, 0)

CodePudding user response:

We can leverage the point that True is 1 and False is 0 when used in multiplication.

3 * True  -> 3
3 * False -> 0

We have to check if values are equal to online in the status column.

df['cost'] = df['price'] * df['status'].eq('online') * metric

Wherever, status is offline cost value is 0.


The above solution relies on the fact you want to set offline values to 0. If you want to set offline to let's 999. Then we can use Series.where here.

df['cost'] = df['price'].mul(metric).where(df['status'].eq('online'), 999)

Now, every offline value to set to 999.

Useful links:

CodePudding user response:

If you want plain pandas you can use loc:

df['cost'] = 0
df.loc[df['status']== "online" , 'cost'] = df['price'] * metric

Or with a list comprehension:

df['cost'] = [df['price']*metric if status == 'online' else 0 for status in df['status']]

CodePudding user response:

Although, all of the answers here solve your problem but using np.where or np.select (already showed in above answers) is best to use because it can solve multiple conditions.

However, above solutions with np.where and np.select can give wrong output if data contains nan values.

Let me simplify and explain these two (np.select and pd.cut) for future cases also.

df = pd.DataFrame({'status':['online','offline','away','busy',np.nan]})
df['price'] = 100
df

status  price
0   online  100
1   offline     100
2   away    100
3   busy    100
4   NaN     100

Let's say you have 4 conditions:

status = online then  df['cost'] =df['price']*metric
status = offline then df['cost'] = 0
status = away then df['cost'] = 1
status = busy then df['cost'] = 2

then you can just put all these conditions simply in one segment like this:

col         = 'status'
conditions  = [ df[col] == 'online' , (df[col] == 'offline') , (df[col]== 'away'), df[col] == 'busy']
options     = [df['price']*metric, 0, 1, 2 ]

df['cost'] = np.select(conditions, options)

after running this.

    status  price   cost
0   online  100     500
1   offline     100     0
2   away    100     1
3   busy    100     2
4   NaN     100     0

you can observe the value is 0 in front of nan. To tackle this problem, use:

df['cost'] = np.select(conditions, options,default = np.nan)
  • Related