Delete the rows which has some certain Nan values-CodePudding

I have a data frame which have some Nan values. So, I want to delete the rows which has two or more than 2 nan values. Also, replace the nan values in the other rows with the mean of row. Here is simple example:

import numpy as np
df = pd.DataFrame()
df['id'] = [1, 2, 3, 4, 5,6]
df['val1'] = [1, np.nan, 2, np.nan, 3, 5]
df['val2'] = [np.nan, np.nan, 2, np.nan, np.nan, 5]
df['val3'] = [4, np.nan, 2, np.nan, 7, 5]
df['val4'] = [3, np.nan, 2, np.nan, np.nan, 5]

     id  val1  val2  val3  val4
 0   1   1.0   NaN   4.0   3.0
 1   2   NaN   NaN   NaN   NaN
 2   3   2.0   2.0   2.0   2.0
 3   4   NaN   NaN   NaN   NaN
 4   5   3.0   NaN   7.0   NaN
 5   6   5.0   5.0   5.0   5.0

The output that I want is:

      id    val1  val2    val3  val4
0     1     1     2.67     4     3
1     3     2     2.00     2     2
2     6     5     5.00     5     5

CodePudding user response：

as @Lucas Teixeira mentioned

df.dropna(thresh=4).T.fillna(df2.drop(columns=["id"]).mean(axis=1)).T

df[df.apply(lambda x: sum(x.isna()) <= 1, axis=1)].T.fillna(df2.drop(columns=["id"]).mean(axis=1)).T

CodePudding user response：

To get rid of columns with 2 or more NaN values you can use df.dropna(thresh=4), where thresh means how many columns need to be non-NaN to not be dropped. Than to fill the NaN value in index 0 with the mean value you can use
df.fillna(df.drop(columns=["id"]).iloc[0].mean()).

output:

    id  val1    val2          val3      val4
0   1   1.0     2.666667      4.0       3.0
2   3   2.0     2.000000      2.0       2.0
5   6   5.0     5.000000      5.0       5.0

You can then use df.reset_index() if you really need to get the index values to be ordered "0, 1, 2" like in your output and not "0, 2, 5" like in mine.

If you have more data than just this dataframe you posted you can use a loop and you don't need to check row by row to fill all NaN values. Check @MoRe answer for this approach.

CodePudding user response：

Using a mask we can fill nan values with the mean of the row.

df.dropna(thresh=4).mask(df.isna(), df.mean(axis=1), axis=0)

Output:

   id  val1  val2  val3  val4
0   1   1.0  2.25   4.0   3.0
2   3   2.0  2.00   2.0   2.0
5   6   5.0  5.00   5.0   5.0