Home > Software engineering >  How to make a calculation in a pandas daframe depending on a value of a certain column
How to make a calculation in a pandas daframe depending on a value of a certain column

Time:01-11

I have this dataframe and I want to make a calculation depending on a condition, like below:

   count  prep  result
     0      10     100
    10      100    100

I wanto to create a new column evaluated that is:

if df['count']==0: 
    df['evaluated'] = df['result'] / df['prep']
else:
    df['evaluated'] = df['result'] / df['count']

expected result is:

   count  prep  result   evaluated
     0      10     100      10
    100     10     100      1

What's the best way to do it? My real dataframe has 30k rows.

CodePudding user response:

You can use where or mask:

df['evaluated'] = df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))

Or:

df['evaluated'] = df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))

Output (assuming there was an error in the provided input):

   count  prep  result  evaluated
0      0    10     100       10.0
1    100    10     100        1.0

CodePudding user response:

You can also use np.where from numpy to do that:

df['evaluated'] = np.where(df['count'] == 0,
                           df['result'] / df['prep'],   # == 0
                           df['result'] / df['count'])  # != 0

Performance (not really significant) over 30k rows:

>>> %timeit df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
652 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
638 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit np.where(df['count'] == 0, df['result'] / df['prep'], df['result'] / df['count'])
462 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • Related