How to evaluate conditions after each other in Pandas .loc?-CodePudding

I have a Pandas DataFrame where column B contains mixed types

    A   B   C
0   1   1   False
1   2   abc False
2   3   2   False
3   4   3   False
4   5   b   False

I want to modify column C to be True when the value in column B is of type int and also has a value greater than or equal to 3. So in this example df['B'][3] should match this condition

I tried to do this:

df.loc[(df['B'].astype(str).str.isdigit()) & (df['B'] >= 3)] = True

However I get the following error because of the str values inside column B:

TypeError: '>' not supported between instances of 'str' and 'int'

If I'm able to only test the second condition on the subset provided after the first condition this would solve my problem I think. What can I do to achieve this?

CodePudding user response：

One solution could be:

df["B"].apply(lambda x: str(x).isdigit() and int(x) >= 3)

If x is not a digit, then the evaluation will stop and won't try to parse x to int - which throws a ValueError if the argument is not parseable into an int.

CodePudding user response：

A good way without the use of apply would be to use pd.to_numeric with errors='coerce' which will change the str type to NaN, without changing the type of column B:

df['C'] = pd.to_numeric(df.B, 'coerce') >= 3

>>> print(df) 

   A    B      C
0  1    1  False
1  2  abc  False
2  3    2  False
3  4    3   True
4  5    b  False

CodePudding user response：

This works (although nikeros' answer is more elegant).

def check_maybe_int(n):
    return int(n) >= 3 if n.isdigit() else False

df.B.apply(check_maybe_int)

But the real answer is, don't do this! Mixed columns prevent a lot of Pandas' optimisations. apply is not vectorised, so it's a lot slower than vector int comparison should be.

CodePudding user response：

There are many ways around this (e.g. use a custom (lambda) function with df.apply, use df.replace() first), but I think the easiest way might be just to use an intermediate column. First, create a new column that does the first check, then do the second check on this new column.

CodePudding user response：

you can use apply(type) as picture illustrate

d = {'col1': [1, 2,1, 2], 'col2': [3, 4,1, 2],'col3': [1, 2,1, 2],'col4': [1, 'e',True, 2.345]}
df = pd.DataFrame(data=d)
a = df.col4.apply(type)
b = [ i==str for i in a  ]
df['col5'] = b