Home > Blockchain >  Interpolated values for specific missing indices of DataFrame or Series
Interpolated values for specific missing indices of DataFrame or Series

Time:09-20

With a dataframe like this:

import pandas as pd

df = pd.DataFrame([
    {'key':  1, 'value': 0.4},
    {'key':  4, 'value': 0.5},
    {'key':  6, 'value': 0.7},
    {'key': 10, 'value': 1.3},
    {'key': 11, 'value': 1.4},
    {'key': 13, 'value': 1.1},
])
df.set_index('key', inplace=True)

I'd like to extract values that are either in the dataframe, or should be interpolated from existing values.

I'm aware of DataFrame.interpolate() and it's perfect for quickly computing interpolated values for indices with NaN values. So, an approach could be to add all the indices that aren't already in the index, sort the dataframe by index, interpolate and then extract the values again. Something like:

import numpy as np

new_rows = pd.DataFrame([
    {'key': index, 'value': np.nan} for index in indices if index not in df.index
])
new_rows.set_index('key', inplace=True)
result = df.append(new_rows).sort_index().interpolate(method='spline', order=2)

print(result['value'][indices])

Result:

key
3     0.529559
6     0.700000
9     1.073190
12    1.252086
15    1.369036
Name: value, dtype: float64

However, the whole process of creating an additional dataframe, appending it to the original, sorting by index, calling .interpolate() on the whole result and then extracting the required values seems to be a lot more complication than what I'd expected to find.

Something like:

# fictional, doesn't exist:
result = df.interpolated(indices)  # a DataFrame with only the rows for given indices, interpolated as needed
print(result['value'])

Or:

# fictional, doesn't exist:
result = df['value'].interpolated(indices)  # perhaps only on a Series
print(result)

Am I missing something obvious and is similar functionality actually available? Or is my approach above actually close to what the best way to do it would be?

After posting, I found a somewhat nicer approach myself, but would still like to hear if someone knows of a more efficient, pythonic or simpler approach:

indices = [3, 6, 9, 12, 15]


def interpolated(df, indices, *args, **kwargs):
    for index in indices:
        if index not in df.index:
            df = df.append(pd.Series(name=index))
    return df.sort_index().interpolate(*args, **kwargs).loc[indices]


print(interpolated(df, indices, 'spline', order=2))

CodePudding user response:

You can use scipy's interp1d:

from scipy.interpolate import interp1d

interp = interp1d(df.index, df, axis=0)

interp([3,6,9])

Output (I duplicated the value column):

array([[0.46666667, 0.46666667],
       [0.7       , 0.7       ],
       [1.15      , 1.15      ]])
  • Related