Home > Net >  insert missing rows in a Dataframe and fill with previous row values for other columns
insert missing rows in a Dataframe and fill with previous row values for other columns

Time:03-06

I have a dataframe with a column named DateTime with datetime values populated every 5 seconds. But few rows are missing which can be identified by seeing time difference between previous and current row. I want to insert the missing rows and populate other column with previous row values.

My Sample dataframe is like below:

           DateTime       Price
2022-03-04 09:15:00    34526.00
2022-03-04 09:15:05    34487.00
2022-03-04 09:15:10    34470.00
2022-03-04 09:15:20    34466.00
2022-03-04 09:15:45    34448.00

Result dataframe as below:

           DateTime       Price
2022-03-04 09:15:00    34526.00
2022-03-04 09:15:05    34487.00
2022-03-04 09:15:10    34470.00
2022-03-04 09:15:15    34470.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:20    34466.00
2022-03-04 09:15:25    34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:30    34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:35    34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:40    34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:45    34448.00

CodePudding user response:

Try resample then ffill:

df['DateTime'] = pd.to_datetime(df['DateTime']) # change to datetime dtype
df = df.set_index('DateTime')                   # move DateTime into index 

df_out = df.resample('5S').ffill()              # resample 5 secs and forward fill

Output:

                       Price
DateTime                    
2022-03-04 09:15:00  34526.0
2022-03-04 09:15:05  34487.0
2022-03-04 09:15:10  34470.0
2022-03-04 09:15:15  34470.0
2022-03-04 09:15:20  34466.0
2022-03-04 09:15:25  34466.0
2022-03-04 09:15:30  34466.0
2022-03-04 09:15:35  34466.0
2022-03-04 09:15:40  34466.0
2022-03-04 09:15:45  34448.0

CodePudding user response:

An alternative, using an outer join:

t = pd.date_range(df.DateTime.min(), df.DateTime.max(), freq="5s", name="DateTime")
pd.merge(pd.DataFrame(t), df, how="outer").ffill()

Output:

Out[3]:
             DateTime    Price
0 2022-03-04 09:15:00  34526.0
1 2022-03-04 09:15:05  34487.0
2 2022-03-04 09:15:10  34470.0
3 2022-03-04 09:15:15  34470.0
4 2022-03-04 09:15:20  34466.0
5 2022-03-04 09:15:25  34466.0
6 2022-03-04 09:15:30  34466.0
7 2022-03-04 09:15:35  34466.0
8 2022-03-04 09:15:40  34466.0
9 2022-03-04 09:15:45  34448.0

CodePudding user response:

Another option:

  1. Create a new dataframe with the range of dates you want

    df_2 = pd.DataFrame({
        "DateTime": pd.date_range(start=df.loc[0, "DateTime"], end=df.loc[len(df.index)-1, "DateTime"], freq="5s")
    })
    
  2. Merge the new and the original dataframe using outer join

    df = pd.merge(df, df_2, how="outer").sort_values("DateTime")
    
  3. Fill empty values using .fillna(method="ffill")

    df.fillna(method="ffill")
    

Output:

             DateTime    Price
0 2022-03-04 09:15:00  34526.0
1 2022-03-04 09:15:05  34487.0
2 2022-03-04 09:15:10  34470.0
5 2022-03-04 09:15:15  34470.0
3 2022-03-04 09:15:20  34466.0
6 2022-03-04 09:15:25  34466.0
7 2022-03-04 09:15:30  34466.0
8 2022-03-04 09:15:35  34466.0
9 2022-03-04 09:15:40  34466.0
4 2022-03-04 09:15:45  34448.0

Resulting code:

df_2 = pd.DataFrame({
    "DateTime": pd.date_range(start=df.loc[0, "DateTime"], end=df.loc[len(df.index)-1, "DateTime"], freq="5s")
})
df = pd.merge(df, df_2, how="outer").sort_values("DateTime")
df = df.fillna(method="ffill")

print(df)
  • Related