Row-wise linear imputation-CodePudding

I have a data frame with time-series features. I want to impute missing values with row-wise linear imputation.

As a reproducible example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'id': range(2), 
                   'F1_Date_1': [1,2], 
                   'F1_Date_2': [np.nan,4], 
                   'F1_Date_3': [3, 6],
                   'F1_Date_4': [4,8], 
                   'F2_Date_1': [2,11], 
                   'F2_Date_2': [6, np.nan], 
                   'F2_Date_3': [10, np.nan],
                   'F2_Date_4': [14, 17]})
df

   id  F1_Date_1  F1_Date_2  ...  F2_Date_2  F2_Date_3  F2_Date_4
0   0          1        NaN  ...        6.0       10.0         14
1   1          2        4.0  ...        NaN        NaN         17

For F1, I want to linearly impute (interpolate) F1_Date_2 using F1_Date_1 and F1_Date_3. For F2, I want to impute F2_Date_2 and F2_Date_3 using F2_Date_1 and F2_Date_4

The desired output is

final_df = pd.DataFrame({'id': range(2), 
                   'F1_Date_1': [1,2], 
                   'F1_Date_2': [2,4], 
                   'F1_Date_3': [3, 6],
                   'F1_Date_4': [4,8], 
                   'F2_Date_1': [2,11], 
                   'F2_Date_2': [6, 13], 
                   'F2_Date_3': [10, 15],
                   'F2_Date_4': [14, 17]})

   id  F1_Date_1  F1_Date_2  ...  F2_Date_2  F2_Date_3  F2_Date_4
0   0          1          2  ...          6         10         14
1   1          2          4  ...         13         15         17

How can I do that on a very large dataset (potentially 10 million rows, 8 dates, and 15 features) efficiently in Python?

CodePudding user response：

Use groupby per columns - values before _ like F1, F2 with custom function for interpolate:

df = (df.set_index('id')
        .groupby(lambda x: x.split('_')[0], axis=1)
        .apply(lambda x: x.interpolate(axis=1)))

print (df)
    F1_Date_1  F1_Date_2  F1_Date_3  F1_Date_4  F2_Date_1  F2_Date_2  \
id                                                                     
0         1.0        2.0        3.0        4.0        2.0        6.0   
1         2.0        4.0        6.0        8.0       11.0       13.0   

    F2_Date_3  F2_Date_4  
id                        
0        10.0       14.0  
1        15.0       17.0

CodePudding user response：

I managed to do it by melting, interpolating and 'unmelting' the dataframe.

df = pd.melt(df, id_vars=["id"])
df = df.groupby(df['variable'].str[:2]).apply(lambda g: g.interpolate()).pivot_table(index='id', columns='variable', values='value').reset_index().rename_axis(None, axis=1)
df

CodePudding user response：

IIUC, you want to do row-wise interpolation:

final_df = df.interpolate(axis=1)

There is no need to explicitly distinguish between the F1 and the F2 parts if the outmost columns in each part, i.e. F1_Date_1, F1_Date_4, F2_Date_1, and F2_Date_4 do not contain any missing values.