I have this df:
import pandas as pd
import numpy as np
d = {'name': ['bob', 'jake','Sem'], 'F1': [3, 4, np.nan], 'F2': [14, 40, 7], 'F3':
[np.nan, 1, 55]}
df = pd.DataFrame(data=d)
print (df)
out>>>
name F1 F2 F3
0 bob 3.0 14 NaN
1 jake 4.0 40 1.0
2 Sem NaN 7 NaN
I would like to delete all the rows that under at least 2 columns (between F1 F2 and F3) are NaN. Like:
name F1 F2 F3
0 bob 3.0 14 NaN
1 jake 4.0 40 1.0
This is just an example, but I may have up to many columns (up to F100) and I may want to delete with other values instead of 2 out of 3 columns. What is the best way to do this?
CodePudding user response:
You can use the subset
and thresh
parameters of dropna
:
df.dropna(subset=['F1', 'F2', 'F3'], thresh=2)
Example:
import pandas as pd
import numpy as np
d = {'name': ['bob', 'jake', 'Sem', 'Mario'],
'F1': [3, 4, np.nan, np.nan],
'F2': [14, 40, 7, 42],
'F3': [np.nan, 1, 55, np.nan]}
df = pd.DataFrame(data=d)
print(df)
name F1 F2 F3
0 bob 3.0 14 NaN
1 jake 4.0 40 1.0
2 Sem NaN 7 55.0
3 Mario NaN 42 NaN
df2 = df.dropna(subset=['F1', 'F2', 'F3'], thresh=2)
print(df2)
name F1 F2 F3
0 bob 3.0 14 NaN
1 jake 4.0 40 1.0
2 Sem NaN 7 55.0
Selecting the columns automatically:
cols = list(df.filter(regex=r'F\d '))
df.dropna(subset=cols, thresh=2)
alternative without dropna
Using boolean indexing:
m = df.filter(regex=r'F\d ').isna().sum(1)
df2 = df[m.lt(2)]
values above threshold
drop if at least 2 values ≥ 4:
m = df.filter(regex=r'F\d ').gt(4).sum(1)
df2 = df[m.lt(2)]