Home > Blockchain >  Update Pandas column based on condition
Update Pandas column based on condition

Time:07-08

Following the title of the question, the case is this:

Creating dataframe:

import pandas as pd

df = pd.DataFrame({ 'a': ['one', 'one', 'three', 'two', 'eleven', 'two'],
              'b': [45, 34, 556, 32, 97, 33],
              'c': [234, 66, 12, 44, 99, 3],
              'd': [123, 45, 55, 98, 17, 22] })
df

Output:

        a   b   c   d
    0   one  45 234 123
    1   one  34 66  45
    2   three 556   12  55
    3   two 32  44  98
    4   eleven97    99  17
    5   two 33  3   22

Let's say I want to add a column 'e' which is the sum of the columns 'b', 'c' and 'd'. It's simple:

df['e'] = df.b   df.c   df.d
df

Output:

    a   b   c   d   e
0   one 45  234 123 402
1   one 34  66  45  145
2   three   556 12  55  623
3   two 32  44  98  174
4   eleven  97  99  17  213
5   two 33  3   22  58

Now I want one more column 'f' , but based on the following condition:

if df.a == 'one' and df.b < 50:
    df['f'] = 0
elif df.a == 'two' and df.d > 50:
    df['f'] = 1
else:
    df['f'] = 2

But of course this code does not work.

out:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), 
a.item(), a.any() or a.all().

How could those condition be correctly implemented?

CodePudding user response:

You can use np.select for this:

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'a': ['one', 'one', 'three', 'two', 'eleven', 'two'],
              'b': [45, 34, 556, 32, 97, 33],
              'c': [234, 66, 12, 44, 99, 3],
              'd': [123, 45, 55, 98, 17, 22] })

df['e'] = df.b   df.c   df.d

# list with your conditions
conditions = [(df.a == 'one') & (df.b < 50),
              (df.a == 'two') & (df.d > 50)]

# list with accompanying choices
choices = [0,1]

df['f'] = np.select(conditions, choices, 2) 
# 2 being the default: i.e. the 'else' choice.

df

        a    b    c    d    e  f
0     one   45  234  123  402  0
1     one   34   66   45  145  0
2   three  556   12   55  623  2
3     two   32   44   98  174  1
4  eleven   97   99   17  213  2
5     two   33    3   22   58  2

CodePudding user response:

You can use nested np.where methods:

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'a': ['one', 'one', 'three', 'two', 'eleven', 'two'],
              'b': [45, 34, 556, 32, 97, 33],
              'c': [234, 66, 12, 44, 99, 3],
              'd': [123, 45, 55, 98, 17, 22] })
df['e'] = df.b   df.c   df.d
df['f'] = np.where(
    (df.a == 'one') & (df.b < 50), 
    0, 
    np.where(
        (df.a == 'two') & (df.d > 50), 
        1, 
        2
    )
)

Output:

        a    b    c    d    e  f
0     one   45  234  123  402  0
1     one   34   66   45  145  0
2   three  556   12   55  623  2
3     two   32   44   98  174  1
4  eleven   97   99   17  213  2
5     two   33    3   22   58  2

CodePudding user response:

def setter(x):
    if x.a == 'one' and x.b < 50:
        return 0
    elif x.a == 'two' and x.d > 50:
        return 1
    else:
        return 2

df['f'] = df.apply(lambda x: setter(x), axis=1)

CodePudding user response:

One option is case_when from pyjanitor; it is a wrapper around pd.Series.mask and as much as possible passes all the hardwork of dtypes and the like to Pandas:

# pip install pyjanitor
import pandas as pd
import janitor

(df
.assign(e = df.b   df.c   df.d)
# case_when is an alternation 
# of conditions and expected values
.case_when(df.a.eq('one') & df.b.lt(50), 0, # condition, value
           df.a.eq('two') & df.d.gt(50), 1, 
           2, # default
           column_name = 'f')
)

        a    b    c    d    e  f
0     one   45  234  123  402  0
1     one   34   66   45  145  0
2   three  556   12   55  623  2
3     two   32   44   98  174  1
4  eleven   97   99   17  213  2
5     two   33    3   22   58  2
  • Related