Pandas interpolate with condition-CodePudding

I have a dataframe like this

import pandas as pd
import numpy as np

df = pd.DataFrame({'date':pd.date_range(start='13/11/2021', periods=11),
                    'a': [1, np.nan, np.nan, np.nan, np.nan, 2, np.nan, np.nan, np.nan, np.nan, 3],
                   'b': [4, np.nan, np.nan, np.nan, np.nan, 6, np.nan, np.nan, np.nan, np.nan, 7],
                   }).set_index('date')

              a    b
date                
2021-11-13  1.0  4.0
2021-11-14  NaN  NaN
2021-11-15  NaN  NaN
2021-11-16  NaN  NaN
2021-11-17  NaN  NaN
2021-11-18  2.0  6.0
2021-11-19  NaN  NaN
2021-11-20  NaN  NaN
2021-11-21  NaN  NaN
2021-11-22  NaN  NaN
2021-11-23  3.0  7.0

How do I linearly interpolate it upto n% interval between two non-nan values then fill the rest with the upper bound.
The interval between two non-nan values will remain constant throughout the dataframe.
Now for example, with n = 0.5

                   a         b
date                          
2021-11-13  1.000000  4.000000  << #  original value --------------
2021-11-14  1.333333  4.666667                                     |
2021-11-15  1.666667  5.333333                                     |  50% linearly
2021-11-16  2.000000  6.000000  <- linear interpolation upto here   |  interpolated rest are
2021-11-17  2.000000  6.000000                                     |  filled.
2021-11-18  2.000000  6.000000  << # original value (upper bound)---
2021-11-19  2.333333  6.333333
2021-11-20  2.666667  6.666667
2021-11-21  3.000000  7.000000  <- linear interpolation upto here
2021-11-22  3.000000  7.000000
2021-11-23  3.000000  7.000000  << # original value (upper bound)

CodePudding user response：

I don't think there is a pandas function for this purpose. You'll have to create yours:

def interpol_segment(df, r):
    nan_idx = df[pd.isna(df.a) | pd.isna(df.b)].index   # nan rows
    consecutive_nan = []        # a list of range (list)
    date_inter = [nan_idx[0]]   # range (list) of consecutive dates
    for i, date in enumerate(nan_idx[1:], start=1):
        if date - nan_idx[i-1] == pd.Timedelta(1, unit="day"):
            date_inter.append(date)
        else:
            consecutive_nan.append(date_inter)
            date_inter = [date]

    consecutive_nan.append(date_inter) # appending last interval
    # dates to be filled with upper bound:
    capped_date = [x[i] for x in consecutive_nan for i,_ in enumerate(x) if i >= int(r*len(x))]
    df_capped = df.loc[capped_date]
    # interpoling without the capped rows:
    df_inter = df.drop(capped_date).interpolate()
    # reassembling and filling with bfill:
    return pd.concat([df_inter,df_capped]).sort_index().bfill()


print(interpol_segment(df, 0.5))

Output:

                   a         b
date                          
2021-11-13  1.000000  4.000000
2021-11-14  1.333333  4.666667
2021-11-15  1.666667  5.333333
2021-11-16  2.000000  6.000000
2021-11-17  2.000000  6.000000
2021-11-18  2.000000  6.000000
2021-11-19  2.333333  6.333333
2021-11-20  2.666667  6.666667
2021-11-21  3.000000  7.000000
2021-11-22  3.000000  7.000000
2021-11-23  3.000000  7.000000

CodePudding user response：

Tampering with the limit argument seems to get the job done.
A little tricky but efficient solution.

# Heres the tricky part, calculate the no of nans in each interval
t = df.iloc[:, 0]
x = len(t)              # total length
y = t.isna().sum()      # no of nans
z = x - y               # no of non-nans 
w = y/(z-1)             # no of nans in each interval
n = 0.5

# use the limit to backfill n% nans
df = df.interpolate(method='bfill', limit=int(np.ceil(w*n)))

# Linear inerpolate the rest
print(df.interpolate())

                     a         b
date                          
2021-11-13  1.000000  4.000000
2021-11-14  1.333333  4.666667
2021-11-15  1.666667  5.333333
2021-11-16  2.000000  6.000000
2021-11-17  2.000000  6.000000
2021-11-18  2.000000  6.000000
2021-11-19  2.333333  6.333333
2021-11-20  2.666667  6.666667
2021-11-21  3.000000  7.000000
2021-11-22  3.000000  7.000000
2021-11-23  3.000000  7.000000