I want to get in a new dataframe the rows of an original dataframe where there is a non-real (i.e. string) value in a specific column.
import pandas as pd
import numpy as np
test = {'a':[1,2,3],
'b':[4,5,'x'],
'c':['f','g','h']}
df_test = pd.DataFrame(test)
print(df_test)
I want to get the third row where the value in 'b' column is not numeric (it is 'x').
CodePudding user response:
The complication is that Pandas forces column elements to have the same type (object for mixed str and int) so simple selection is not possible. Hence I think it is necessary to iterate over the column of interest to select the row(s) and then extract that/those.
mask = []
for j in df_test['b']:
if isinstance(j, str):
mask.append(True)
else:
mask.append(False)
print(df_test[mask])
which produces
a b c
2 3 x h
CodePudding user response:
You'll need to perform some type of list comprehension or element-wise apply and build a boolean mask for this type of problem. You can use any of the following approaches (you should see similar performance for all).
isinstance .apply
mask = df_test['b'].apply(isinstance, args=(str, ))
print(df_test.loc[mask])
a b c
2 3 x h
isinstance list comprehension
mask = [isinstance(v, str) for v in df_test['b']]
print(df_test.loc[mask])
a b c
2 3 x h
coerce to numeric and find nans
mask = pd.to_numeric(df_test['b'], errors='coerce').isna()
print(df_test.loc[mask])
a b c
2 3 x h