In the dataset I'm working on, the Adult dataset, the missing values are indicated with the "?"
string, and I want to discard the rows containing missing values.
In the documentation of the method df.dropna()
there is no argument that offers the possibility of passing a custom value to interpret as the null/missing value,
I know I can simply solve the problem with something like:
df_str = df.select_dtypes(['object']) # get the columns containing the strings
for col in df_str.columns:
df = df[df[col] != '?']
but I was wondering if there is a standard way of achieving this using Pandas apis which possibly offers more flexibility all while being faster.
CodePudding user response:
You can do any
, this is to check row not contain ?
: if match it will return True
, the ~
will turn that to False
and filter
df = df[~df_str.eq('?').any(1)]
CodePudding user response:
You could replace
it with NaN and dropna
:
df = df.replace('?', float('nan')).dropna()
CodePudding user response:
df.replace('?', np.nan, inplace=True)
followed by .dropna()