I have a dataframe with a column named DateTime with datetime values populated every 5 seconds. But few rows are missing which can be identified by seeing time difference between previous and current row. I want to insert the missing rows and populate other column with previous row values.
My Sample dataframe is like below:
DateTime Price
2022-03-04 09:15:00 34526.00
2022-03-04 09:15:05 34487.00
2022-03-04 09:15:10 34470.00
2022-03-04 09:15:20 34466.00
2022-03-04 09:15:45 34448.00
Result dataframe as below:
DateTime Price
2022-03-04 09:15:00 34526.00
2022-03-04 09:15:05 34487.00
2022-03-04 09:15:10 34470.00
2022-03-04 09:15:15 34470.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:20 34466.00
2022-03-04 09:15:25 34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:30 34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:35 34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:40 34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:45 34448.00
CodePudding user response:
Try resample
then ffill
:
df['DateTime'] = pd.to_datetime(df['DateTime']) # change to datetime dtype
df = df.set_index('DateTime') # move DateTime into index
df_out = df.resample('5S').ffill() # resample 5 secs and forward fill
Output:
Price
DateTime
2022-03-04 09:15:00 34526.0
2022-03-04 09:15:05 34487.0
2022-03-04 09:15:10 34470.0
2022-03-04 09:15:15 34470.0
2022-03-04 09:15:20 34466.0
2022-03-04 09:15:25 34466.0
2022-03-04 09:15:30 34466.0
2022-03-04 09:15:35 34466.0
2022-03-04 09:15:40 34466.0
2022-03-04 09:15:45 34448.0
CodePudding user response:
An alternative, using an outer join:
t = pd.date_range(df.DateTime.min(), df.DateTime.max(), freq="5s", name="DateTime")
pd.merge(pd.DataFrame(t), df, how="outer").ffill()
Output:
Out[3]:
DateTime Price
0 2022-03-04 09:15:00 34526.0
1 2022-03-04 09:15:05 34487.0
2 2022-03-04 09:15:10 34470.0
3 2022-03-04 09:15:15 34470.0
4 2022-03-04 09:15:20 34466.0
5 2022-03-04 09:15:25 34466.0
6 2022-03-04 09:15:30 34466.0
7 2022-03-04 09:15:35 34466.0
8 2022-03-04 09:15:40 34466.0
9 2022-03-04 09:15:45 34448.0
CodePudding user response:
Another option:
Create a new dataframe with the range of dates you want
df_2 = pd.DataFrame({ "DateTime": pd.date_range(start=df.loc[0, "DateTime"], end=df.loc[len(df.index)-1, "DateTime"], freq="5s") })
Merge the new and the original dataframe using outer join
df = pd.merge(df, df_2, how="outer").sort_values("DateTime")
Fill empty values using
.fillna(method="ffill")
df.fillna(method="ffill")
Output:
DateTime Price
0 2022-03-04 09:15:00 34526.0
1 2022-03-04 09:15:05 34487.0
2 2022-03-04 09:15:10 34470.0
5 2022-03-04 09:15:15 34470.0
3 2022-03-04 09:15:20 34466.0
6 2022-03-04 09:15:25 34466.0
7 2022-03-04 09:15:30 34466.0
8 2022-03-04 09:15:35 34466.0
9 2022-03-04 09:15:40 34466.0
4 2022-03-04 09:15:45 34448.0
Resulting code:
df_2 = pd.DataFrame({
"DateTime": pd.date_range(start=df.loc[0, "DateTime"], end=df.loc[len(df.index)-1, "DateTime"], freq="5s")
})
df = pd.merge(df, df_2, how="outer").sort_values("DateTime")
df = df.fillna(method="ffill")
print(df)