I have a Pandas DataFrame where column B
contains mixed types
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 False
4 5 b False
I want to modify column C
to be True when the value in column B
is of type int
and also has a value greater than or equal to 3. So in this example df['B'][3]
should match this condition
I tried to do this:
df.loc[(df['B'].astype(str).str.isdigit()) & (df['B'] >= 3)] = True
However I get the following error because of the str
values inside column B
:
TypeError: '>' not supported between instances of 'str' and 'int'
If I'm able to only test the second condition on the subset provided after the first condition this would solve my problem I think. What can I do to achieve this?
CodePudding user response:
One solution could be:
df["B"].apply(lambda x: str(x).isdigit() and int(x) >= 3)
If x is not a digit, then the evaluation will stop and won't try to parse x
to int
- which throws a ValueError
if the argument is not parseable into an int
.
CodePudding user response:
A good way without the use of apply would be to use pd.to_numeric
with errors='coerce'
which will change the str
type to NaN
, without changing the type of column B:
df['C'] = pd.to_numeric(df.B, 'coerce') >= 3
>>> print(df)
A B C
0 1 1 False
1 2 abc False
2 3 2 False
3 4 3 True
4 5 b False
CodePudding user response:
This works (although nikeros' answer is more elegant).
def check_maybe_int(n):
return int(n) >= 3 if n.isdigit() else False
df.B.apply(check_maybe_int)
But the real answer is, don't do this! Mixed columns prevent a lot of Pandas' optimisations. apply
is not vectorised, so it's a lot slower than vector int comparison should be.
CodePudding user response:
There are many ways around this (e.g. use a custom (lambda) function with df.apply
, use df.replace()
first), but I think the easiest way might be just to use an intermediate column.
First, create a new column that does the first check, then do the second check on this new column.
CodePudding user response:
you can use apply(type)
as picture illustrate
d = {'col1': [1, 2,1, 2], 'col2': [3, 4,1, 2],'col3': [1, 2,1, 2],'col4': [1, 'e',True, 2.345]}
df = pd.DataFrame(data=d)
a = df.col4.apply(type)
b = [ i==str for i in a ]
df['col5'] = b