I have a data frame which have some Nan values. So, I want to delete the rows which has two or more than 2 nan values. Also, replace the nan values in the other rows with the mean of row. Here is simple example:
import numpy as np
df = pd.DataFrame()
df['id'] = [1, 2, 3, 4, 5,6]
df['val1'] = [1, np.nan, 2, np.nan, 3, 5]
df['val2'] = [np.nan, np.nan, 2, np.nan, np.nan, 5]
df['val3'] = [4, np.nan, 2, np.nan, 7, 5]
df['val4'] = [3, np.nan, 2, np.nan, np.nan, 5]
id val1 val2 val3 val4
0 1 1.0 NaN 4.0 3.0
1 2 NaN NaN NaN NaN
2 3 2.0 2.0 2.0 2.0
3 4 NaN NaN NaN NaN
4 5 3.0 NaN 7.0 NaN
5 6 5.0 5.0 5.0 5.0
The output that I want is:
id val1 val2 val3 val4
0 1 1 2.67 4 3
1 3 2 2.00 2 2
2 6 5 5.00 5 5
CodePudding user response:
as @Lucas Teixeira mentioned
df.dropna(thresh=4).T.fillna(df2.drop(columns=["id"]).mean(axis=1)).T
or
df[df.apply(lambda x: sum(x.isna()) <= 1, axis=1)].T.fillna(df2.drop(columns=["id"]).mean(axis=1)).T
CodePudding user response:
To get rid of columns with 2 or more NaN values you can use
df.dropna(thresh=4)
, where thresh means how many columns need to be non-NaN to not be dropped. Than to fill the NaN value in index 0 with the mean value you can use
df.fillna(df.drop(columns=["id"]).iloc[0].mean())
.
output:
id val1 val2 val3 val4
0 1 1.0 2.666667 4.0 3.0
2 3 2.0 2.000000 2.0 2.0
5 6 5.0 5.000000 5.0 5.0
You can then use df.reset_index()
if you really need to get the index values to be ordered "0, 1, 2" like in your output and not "0, 2, 5" like in mine.
If you have more data than just this dataframe you posted you can use a loop and you don't need to check row by row to fill all NaN values. Check @MoRe answer for this approach.
CodePudding user response:
Using a mask we can fill nan values with the mean of the row.
df.dropna(thresh=4).mask(df.isna(), df.mean(axis=1), axis=0)
Output:
id val1 val2 val3 val4
0 1 1.0 2.25 4.0 3.0
2 3 2.0 2.00 2.0 2.0
5 6 5.0 5.00 5.0 5.0