Pandas interpolate modified to exclude fill outside limit-CodePudding

I am looking for a way to do a modified pandas interpolate so that consecutive NaN values outside the limit aren't filled into the dataframe.

If this is the dataframe that I am starting with:

df = pd.DataFrame({'col1': [0, np.nan, np.nan, np.nan, 3, 4],
                   'col2': [np.nan, 1, 2, np.nan, 4, np.nan],
                   'col3': [4, np.nan, np.nan, 7, 10, 11]})

df
   col1  col2  col3
0   0.0   NaN   4.0
1   NaN   1.0   NaN
2   NaN   2.0   NaN
3   NaN   NaN   7.0
4   3.0   4.0  10.0
5   4.0   NaN  11.0

and I specify that I want to interpolate with a limit of two, with an inside limit area, as seen below: df.interpolate(method="linear", limit=2, limit_area="inside")
This is the result:

   col1  col2  col3
0   0.00   NaN   4.0
1   0.75   1.0   5.0
2   1.50   2.0   6.0
3    NaN   3.0   7.0
4   3.00   4.0  10.0
5   4.00   NaN  11.0

However, I'm looking for an alternate solution so that the interpolate fill only occurs if there equal to or less than the limit NaNs in a row for a specific column. So that, my desired result would look like this:

   col1  col2  col3
0   0.00   NaN   4.0
1    NaN   1.0   5.0
2    NaN   2.0   6.0
3    NaN   3.0   7.0
4   3.00   4.0  10.0
5   4.00   NaN  11.0

The first column is not filled because there are more than the limit (2) NaNs in a row.

CodePudding user response：

We can just work on individual columns, and apply:

def interpolate(series, thresh=2):
    # where the nan values are
    nans = series.isna()

    # calculate the size of consecutive `nan`
    mask = nans.groupby([(~nans).cumsum(),nans]).transform('size') > thresh
    return series.interpolate(method='linear', limit_area='inside').mask(mask)

df.apply(interpolate)

Note: If you do, e.g. interpolate(df['col1']), then mask would be:

0    False
1     True
2     True
3     True
4    False
5    False
Name: col1, dtype: bool

Output:

   col1  col2  col3
0   0.0   NaN   4.0
1   NaN   1.0   5.0
2   NaN   2.0   6.0
3   NaN   3.0   7.0
4   3.0   4.0  10.0
5   4.0   NaN  11.0

CodePudding user response：

Find where there is a group of three nan and don't put interpolation values there:

mask = df.rolling(3, 0, center=True).count().eq(0).replace(False, np.nan).bfill(limit=1).ffill(limit=1)
df.where(mask, df.interpolate(limit_area='inside'), inplace=True)

Output:

   col1  col2  col3
0   0.0   NaN   4.0
1   NaN   1.0   5.0
2   NaN   2.0   6.0
3   NaN   3.0   7.0
4   3.0   4.0  10.0
5   4.0   NaN  11.0

CodePudding user response：

for i in range(len(df)):
    t=0
    s=df.iloc[i,].isnull().sum()
    if(s>1):
        for col in df.columns:
            print(type(df[col].iloc[i]))
            if pd.isnull(df[col].iloc[i]):
                t =1
                if t>1:
                    df[col][i]=df[col][i-1] 1