Compare two DataFrames and find missing timestamps-CodePudding

I have the following two dataframes:

df1=

   date                col1
0  2023-01-01 16:00:00 100
1  2023-01-01 16:15:00 120
2  2023-01-01 16:30:00 140
3  2023-01-01 16:45:00 160
4  2023-01-01 17:00:00 200
5  2023-01-01 17:15:00 430
6  2023-01-01 17:30:00 890

df2 =

   date                col2 col3 
0  2023-01-01 16:00:00 100  200
1  2023-01-01 16:15:00 120  400
2  2023-01-01 17:00:00 200  500

and in df2 I have some missing timestamps compared to df1. I am able to find those timestamps using the following code:

df1[~df1['date'].isin(df2['date'])]

I want to populate those missing timestamps in df2 and fill in the values of the columns with the average value of the two previous rows.

So the new df2 should look like this:

df2 =

   date                col2    col3 
0  2023-01-01 16:00:00 100     200
1  2023-01-01 16:15:00 120     400
2  2023-01-01 16:30:00 110     300
3  2023-01-01 16:45:00 115     350
4  2023-01-01 17:00:00 200     500
5  2023-01-01 17:15:00 257.5   425
6  2023-01-01 17:30:00 228.75  462.5

CodePudding user response：

Not ideal solution via iteration:

df1 = [
    ['2023-01-01 16:00:00', 100],
    ['2023-01-01 16:15:00', 120],
    ['2023-01-01 16:30:00', 140],
    ['2023-01-01 16:45:00', 160],
    ['2023-01-01 17:00:00', 200],
    ['2023-01-01 17:15:00', 430],
    ['2023-01-01 17:30:00', 890],
]

df2 = [
    ['2023-01-01 16:00:00', 100,  200],
    ['2023-01-01 16:15:00', 120,  400],
    ['2023-01-01 17:00:00', 200,  500],
]

df1= pd.DataFrame(df1, columns = ['date', 'col1'])

df2= pd.DataFrame(df2, columns = ['date', 'col2', 'col3'])

missing = df1[~df1['date'].isin(df2['date'])]
missing = missing.drop(['col1'], axis=1)

merged = pd.concat([df2, missing])
merged.sort_values('date', inplace=True, ignore_index=True)

for index, row in merged.iterrows():
    if np.isnan(row['col2']):
        merged['col2'].at[index] = merged['col2'].iloc[[index-1, index-2]].mean()
    if np.isnan(row['col3']):
        merged['col3'].at[index] = merged['col3'].iloc[[index-1, index-2]].mean()

print(merged)

Output:

date	col2	col3
2023-01-01 16:00:00	100.00	200.0
2023-01-01 16:15:00	120.00	400.0
2023-01-01 16:30:00	110.00	300.0
2023-01-01 16:45:00	115.00	350.0
2023-01-01 17:00:00	200.00	500.0
2023-01-01 17:15:00	157.50	425.0
2023-01-01 17:30:00	178.75	462.5