I have a data frame with time-series features. I want to impute missing values with row-wise linear imputation.
As a reproducible example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': range(2),
'F1_Date_1': [1,2],
'F1_Date_2': [np.nan,4],
'F1_Date_3': [3, 6],
'F1_Date_4': [4,8],
'F2_Date_1': [2,11],
'F2_Date_2': [6, np.nan],
'F2_Date_3': [10, np.nan],
'F2_Date_4': [14, 17]})
df
id F1_Date_1 F1_Date_2 ... F2_Date_2 F2_Date_3 F2_Date_4
0 0 1 NaN ... 6.0 10.0 14
1 1 2 4.0 ... NaN NaN 17
For F1, I want to linearly impute (interpolate) F1_Date_2
using F1_Date_1
and F1_Date_3
. For F2, I want to impute F2_Date_2
and F2_Date_3
using F2_Date_1
and F2_Date_4
The desired output is
final_df = pd.DataFrame({'id': range(2),
'F1_Date_1': [1,2],
'F1_Date_2': [2,4],
'F1_Date_3': [3, 6],
'F1_Date_4': [4,8],
'F2_Date_1': [2,11],
'F2_Date_2': [6, 13],
'F2_Date_3': [10, 15],
'F2_Date_4': [14, 17]})
id F1_Date_1 F1_Date_2 ... F2_Date_2 F2_Date_3 F2_Date_4
0 0 1 2 ... 6 10 14
1 1 2 4 ... 13 15 17
How can I do that on a very large dataset (potentially 10 million rows, 8 dates, and 15 features) efficiently in Python?
CodePudding user response:
Use groupby
per columns - values before _
like F1, F2
with custom function for interpolate:
df = (df.set_index('id')
.groupby(lambda x: x.split('_')[0], axis=1)
.apply(lambda x: x.interpolate(axis=1)))
print (df)
F1_Date_1 F1_Date_2 F1_Date_3 F1_Date_4 F2_Date_1 F2_Date_2 \
id
0 1.0 2.0 3.0 4.0 2.0 6.0
1 2.0 4.0 6.0 8.0 11.0 13.0
F2_Date_3 F2_Date_4
id
0 10.0 14.0
1 15.0 17.0
CodePudding user response:
I managed to do it by melting, interpolating and 'unmelting' the dataframe.
df = pd.melt(df, id_vars=["id"])
df = df.groupby(df['variable'].str[:2]).apply(lambda g: g.interpolate()).pivot_table(index='id', columns='variable', values='value').reset_index().rename_axis(None, axis=1)
df
CodePudding user response:
IIUC, you want to do row-wise interpolation:
final_df = df.interpolate(axis=1)
There is no need to explicitly distinguish between the F1 and the F2 parts if the outmost columns in each part, i.e. F1_Date_1
, F1_Date_4
, F2_Date_1
, and F2_Date_4
do not contain any missing values.