Home > Software engineering >  How to remove entire rows if all columns except one is empty?
How to remove entire rows if all columns except one is empty?

Time:12-01

I want to remove entire rows if all columns except the one is empty. So, imagine that my DataFrame is

df = pd.DataFrame({"col1": ["s1", "s2", "s3", "s4", "s5", "s6"],
                   "col2": [41, np.nan, np.nan, np.nan, np.nan, 61],
                   "col3": [24, 51, np.nan, np.nan, np.nan, 84],
                   "col4": [53, 64, 81, np.nan, np.nan, np.nan],
                   "col5": [43, 83, 47, 12, np.nan, 19]})

which looks like this

    col1   col2   col3   col4   col5
0   s1     41     24     53     43
1   s2     NaN    51     64     83
2   s3     NaN    NaN    81     47
3   s4     NaN    NaN    NaN    12
4   s5     NaN    NaN    NaN    NaN
5   s6     61     84     NaN    19   

In this example, the desired result is

    col1   col2   col3   col4   col5
0   s1     41     24     53     43
1   s2     NaN    51     64     83
2   s3     NaN    NaN    81     47
3   s4     NaN    NaN    NaN    12
4   s6     61     84     NaN    19  

which means that I want to remove the last row. I initially tried with df.dropna(how="all") but it does not work since the last row is not entirely empty (s5 in the col1).

How can I solve this?

CodePudding user response:

Use the thresh parameter:

N = 1
df.dropna(thresh=N 1)

Or if you want to match exactly N NAs (no more no less):

N = 1
out = df[df.isna().sum(axis=1).ne(df.shape[1]-N)]

Output:

  col1  col2  col3  col4  col5
0   s1  41.0  24.0  53.0  43.0
1   s2   NaN  51.0  64.0  83.0
2   s3   NaN   NaN  81.0  47.0
3   s4   NaN   NaN   NaN  12.0

CodePudding user response:

df[df.iloc[:, 1:].notnull().any(axis=1)]

CodePudding user response:

You should use threshold in dropna.

df = df.dropna(axis=0, thresh=2)

CodePudding user response:

You can also try this one to check if the element is a NaN.

np.isnan()

Here is the official documentation for more info. https://numpy.org/doc/stable/user/misc.html

CodePudding user response:

As an alternative to given answers and if you want to use .dropna() you may set col1 as your index with:

df = df.set_index("col1")

This way df.dropna(how='all') will work like a charm. If you do not need it as index anymore you may get your column back by df['col1'] = df.index and reset index df.reset_index(drop=True)
col1 will appear after col5, you may rearrange it back with:

cols = df.columns.tolist()
cols = cols[-1:]   cols[:-1]
df[cols]

CodePudding user response:

import numpy as np
import pandas as pd 

df = pd.DataFrame({"x1": [np.nan, np.nan], "x2": [1, np.nan]})
print(df.head())

for idx, row in df.iterrows():
    if np.isnan(row).all():
        df = df.drop(idx)

print(df.head())

Added in edit:

Simply drop the irrelevant columns from the row count.

import numpy as np
import pandas as pd 

df = pd.DataFrame({"name": ["keep", "remove"], "x1": [np.nan, np.nan], "x2": [1, np.nan]})
print("ORG")
print(df.head())

for idx, row in df.iterrows():
    if np.isnan(row[1:].astype(float)).all():
        df = df.drop(idx)

print("OUT")
print(df.head())
  • Related