I am trying to remove all rows that contain NaN values from a dataframe. However, I have realized that using df.dropna(axis='rows')
, but using df.drop(index=np.where(df.isnull().sum()!=0)[0], axis='index')
gives different result. The former removes fewer rows than the later.
For example, my initial dataframe has 80 cols and 91713 rows.
- If I use
dropna()
the resulting dataframe has 80 cols and 91639 rows (e.g. 74 rows were dropped). - If I instead use
drop()
the new shape is 80 cols and 56935 rows (e.g. 34778 were dropped).
Is there something wrong with how I am getting the indices to input to df.drop()
? I do get 74 columns if I just look at the number of indices I am dropping with that method. E.g with df_nulls = df.iloc[np.where(df.isnull().sum()!=0)[0]]
, df_nulls.shape[0]
is 74.
Update:
I know there is definitely something wrong with the df.drop()
method because when I try to run further processing on the data I get errors related to the there still being NaNs. But why would np.where(df.isnull().sum()!=0)
not find all the NaN values?
Update 2: it is certainly just something wrong with my indexing (see below), but shouldn't iloc give the rows?
indices_rows_with_nulls = np.where(df.isnull().sum()!=0)[0]
df_nulls = df.iloc[indices_rows_with_nulls]
print('df.shape: ' str(df.shape) ' df_nulls.shape: ' str(df_nulls.shape))
indices_rows_without_nulls = np.where(df.isnull().sum()==0)[0]
df_no_nulls = df.iloc[indices_rows_without_nulls]
print('df.shape: ' str(df.shape) ' df_no_nulls.shape: ' str(df_no_nulls.shape))
gives
df.shape: (91713, 80) df_nulls.shape: (74, 80)
df.shape: (91713, 80) df_no_nulls.shape: (6, 80)
CodePudding user response:
You need to sum on columns
df.isnull().sum(axis=1)!=0
# or
df.isnull().sum(axis='columns')!=0