Home > Software design >  Drop rows if some columns have some value
Drop rows if some columns have some value

Time:08-24

I have this df:

import pandas as pd
import numpy as np
d = {'name': ['bob', 'jake','Sem'], 'F1': [3, 4, np.nan], 'F2': [14, 40, 7], 'F3': 
[np.nan, 1, 55]}
df = pd.DataFrame(data=d)
print (df)

out>>>
   name   F1  F2    F3
0   bob  3.0  14   NaN
1  jake  4.0  40   1.0
2   Sem  NaN   7   NaN

I would like to delete all the rows that under at least 2 columns (between F1 F2 and F3) are NaN. Like:

   name   F1  F2    F3
0   bob  3.0  14   NaN
1  jake  4.0  40   1.0

This is just an example, but I may have up to many columns (up to F100) and I may want to delete with other values instead of 2 out of 3 columns. What is the best way to do this?

CodePudding user response:

You can use the subset and thresh parameters of dropna:

df.dropna(subset=['F1', 'F2', 'F3'], thresh=2)

Example:

import pandas as pd
import numpy as np
d = {'name': ['bob', 'jake', 'Sem', 'Mario'],
     'F1': [3, 4, np.nan, np.nan],
     'F2': [14, 40, 7, 42],
     'F3': [np.nan, 1, 55, np.nan]}
df = pd.DataFrame(data=d)
print(df)

    name   F1  F2    F3
0    bob  3.0  14   NaN
1   jake  4.0  40   1.0
2    Sem  NaN   7  55.0
3  Mario  NaN  42   NaN

df2 = df.dropna(subset=['F1', 'F2', 'F3'], thresh=2)
print(df2)

   name   F1  F2    F3
0   bob  3.0  14   NaN
1  jake  4.0  40   1.0
2   Sem  NaN   7  55.0

Selecting the columns automatically:

cols = list(df.filter(regex=r'F\d '))
df.dropna(subset=cols, thresh=2)

alternative without dropna

Using boolean indexing:

m = df.filter(regex=r'F\d ').isna().sum(1)

df2 = df[m.lt(2)]
values above threshold

drop if at least 2 values ≥ 4:

m = df.filter(regex=r'F\d ').gt(4).sum(1)

df2 = df[m.lt(2)]
  • Related