I have a dataframe like this
import pandas as pd
import numpy as np
df = pd.DataFrame({'date':pd.date_range(start='13/11/2021', periods=11),
'a': [1, np.nan, np.nan, np.nan, np.nan, 2, np.nan, np.nan, np.nan, np.nan, 3],
'b': [4, np.nan, np.nan, np.nan, np.nan, 6, np.nan, np.nan, np.nan, np.nan, 7],
}).set_index('date')
a b
date
2021-11-13 1.0 4.0
2021-11-14 NaN NaN
2021-11-15 NaN NaN
2021-11-16 NaN NaN
2021-11-17 NaN NaN
2021-11-18 2.0 6.0
2021-11-19 NaN NaN
2021-11-20 NaN NaN
2021-11-21 NaN NaN
2021-11-22 NaN NaN
2021-11-23 3.0 7.0
How do I linearly interpolate it upto n% interval between two non-nan values then fill the rest with the upper bound.
The interval between two non-nan values will remain constant throughout the dataframe.
Now for example, with n = 0.5
a b
date
2021-11-13 1.000000 4.000000 << # original value --------------
2021-11-14 1.333333 4.666667 |
2021-11-15 1.666667 5.333333 | 50% linearly
2021-11-16 2.000000 6.000000 <- linear interpolation upto here | interpolated rest are
2021-11-17 2.000000 6.000000 | filled.
2021-11-18 2.000000 6.000000 << # original value (upper bound)---
2021-11-19 2.333333 6.333333
2021-11-20 2.666667 6.666667
2021-11-21 3.000000 7.000000 <- linear interpolation upto here
2021-11-22 3.000000 7.000000
2021-11-23 3.000000 7.000000 << # original value (upper bound)
CodePudding user response:
I don't think there is a pandas function for this purpose. You'll have to create yours:
def interpol_segment(df, r):
nan_idx = df[pd.isna(df.a) | pd.isna(df.b)].index # nan rows
consecutive_nan = [] # a list of range (list)
date_inter = [nan_idx[0]] # range (list) of consecutive dates
for i, date in enumerate(nan_idx[1:], start=1):
if date - nan_idx[i-1] == pd.Timedelta(1, unit="day"):
date_inter.append(date)
else:
consecutive_nan.append(date_inter)
date_inter = [date]
consecutive_nan.append(date_inter) # appending last interval
# dates to be filled with upper bound:
capped_date = [x[i] for x in consecutive_nan for i,_ in enumerate(x) if i >= int(r*len(x))]
df_capped = df.loc[capped_date]
# interpoling without the capped rows:
df_inter = df.drop(capped_date).interpolate()
# reassembling and filling with bfill:
return pd.concat([df_inter,df_capped]).sort_index().bfill()
print(interpol_segment(df, 0.5))
Output:
a b
date
2021-11-13 1.000000 4.000000
2021-11-14 1.333333 4.666667
2021-11-15 1.666667 5.333333
2021-11-16 2.000000 6.000000
2021-11-17 2.000000 6.000000
2021-11-18 2.000000 6.000000
2021-11-19 2.333333 6.333333
2021-11-20 2.666667 6.666667
2021-11-21 3.000000 7.000000
2021-11-22 3.000000 7.000000
2021-11-23 3.000000 7.000000
CodePudding user response:
Tampering with the limit
argument seems to get the job done.
A little tricky but efficient solution.
# Heres the tricky part, calculate the no of nans in each interval
t = df.iloc[:, 0]
x = len(t) # total length
y = t.isna().sum() # no of nans
z = x - y # no of non-nans
w = y/(z-1) # no of nans in each interval
n = 0.5
# use the limit to backfill n% nans
df = df.interpolate(method='bfill', limit=int(np.ceil(w*n)))
# Linear inerpolate the rest
print(df.interpolate())
a b
date
2021-11-13 1.000000 4.000000
2021-11-14 1.333333 4.666667
2021-11-15 1.666667 5.333333
2021-11-16 2.000000 6.000000
2021-11-17 2.000000 6.000000
2021-11-18 2.000000 6.000000
2021-11-19 2.333333 6.333333
2021-11-20 2.666667 6.666667
2021-11-21 3.000000 7.000000
2021-11-22 3.000000 7.000000
2021-11-23 3.000000 7.000000