Fill dataframe with consecutive datetimes-CodePudding

I have a DataFrame:

|         init      |          end        | temp
2022-02-02 10:34:00 | 2022-02-02 11:34:00 | 34
2022-02-02 11:34:00 | 2022-02-02 12:34:00 | 12
2022-02-02 13:34:00 | 2022-02-02 14:34:00 | 23
2022-02-02 14:34:00 | 2022-02-02 15:34:00 | 22
2022-02-02 17:34:00 | 2022-02-02 18:34:00 | 18

I need to fill in the missing times (the end of one is the beginning of another) from a start and end date, if I have start=2022-02-02 09:34:00 end=2022-02-02 18:34:00 I need to fill the DataFrame as follows:

|         init      |          end        | temp
**2022-02-02 09:34:00 | 2022-02-02 11:34:00 | 0**
2022-02-02 10:34:00 | 2022-02-02 11:34:00 | 34
2022-02-02 11:34:00 | 2022-02-02 12:34:00 | 12
**2022-02-02 12:34:00 | 2022-02-02 11:34:00 | 0**
2022-02-02 13:34:00 | 2022-02-02 14:34:00 | 23
2022-02-02 14:34:00 | 2022-02-02 15:34:00 | 22
**2022-02-02 15:34:00 | 2022-02-02 11:34:00 | 0**
**2022-02-02 16:34:00 | 2022-02-02 11:34:00 | 0**
2022-02-02 17:34:00 | 2022-02-02 18:34:00 | 18
**2022-02-02 18:34:00 | 2022-02-02 11:34:00 | 0**

CodePudding user response：

You can make temporal dataframe which consist of datetime period, then you can OUTER JOIN (using pd.merge()), as follows:

import pandas as pd
from datetime import timedelta

df = pd.DataFrame({
    'init': ['2022-02-02 10:34:00', '2022-02-02 11:34:00', '2022-02-02 13:34:00', '2022-02-02 14:34:00', '2022-02-02 17:34:00'],
    'end': ['2022-02-02 11:34:00', '2022-02-02 12:34:00', '2022-02-02 14:34:00', '2022-02-02 15:34:00', '2022-02-02 18:34:00'],
    'temp': [34, 12, 23, 22, 18],
})

# to convert str to datetime type for init and end columns
df['init'] = pd.to_datetime(df['init'])
df['end'] = pd.to_datetime(df['end'])

# to create temporal dataframe for additional rows
tmp_df = pd.DataFrame()
tmp_df['init'] = pd.date_range(start=df.iloc[0]['init'] - timedelta(hours=1), end=df.iloc[-1]['end'], freq="H")

# to create final result
result = pd.merge(df, tmp_df, on='init', how='outer')
result = result.sort_values(by=['init']).reset_index(drop=True)
#result['end'] = result['init']   timedelta(hours=1)  # use this if you make end value as init   1 hour
result['end'] = result['end'].apply(lambda x: datetime(2020, 2, 2, 11, 34, 0) if x is pd.NaT else x)
result['temp'] = result['temp'].fillna(0) # convert NaN to 0

print(result)

This will print what you expected:

>>> result
                 init                 end  temp
0 2022-02-02 09:34:00 2020-02-02 11:34:00   0.0
1 2022-02-02 10:34:00 2022-02-02 11:34:00  34.0
2 2022-02-02 11:34:00 2022-02-02 12:34:00  12.0
3 2022-02-02 12:34:00 2020-02-02 11:34:00   0.0
4 2022-02-02 13:34:00 2022-02-02 14:34:00  23.0
5 2022-02-02 14:34:00 2022-02-02 15:34:00  22.0
6 2022-02-02 15:34:00 2020-02-02 11:34:00   0.0
7 2022-02-02 16:34:00 2020-02-02 11:34:00   0.0
8 2022-02-02 17:34:00 2022-02-02 18:34:00  18.0
9 2022-02-02 18:34:00 2020-02-02 11:34:00   0.0

If you want to make "end" column as "init 1 hour", then use this code (already commented in the code), #result['end'] = result['init'] timedelta(hours=1), instead of result['end'] = result['end'].apply(lambda x: datetime(2020, 2, 2, 11, 34, 0) if x is pd.NaT else x).

This will print following:

                 init                 end  temp
0 2022-02-02 09:34:00 2022-02-02 10:34:00   0.0
1 2022-02-02 10:34:00 2022-02-02 11:34:00  34.0
2 2022-02-02 11:34:00 2022-02-02 12:34:00  12.0
3 2022-02-02 12:34:00 2022-02-02 13:34:00   0.0
4 2022-02-02 13:34:00 2022-02-02 14:34:00  23.0
5 2022-02-02 14:34:00 2022-02-02 15:34:00  22.0
6 2022-02-02 15:34:00 2022-02-02 16:34:00   0.0
7 2022-02-02 16:34:00 2022-02-02 17:34:00   0.0
8 2022-02-02 17:34:00 2022-02-02 18:34:00  18.0
9 2022-02-02 18:34:00 2022-02-02 19:34:00   0.0

CodePudding user response：

You can use a combination of pd.date_range and pd.Timedelta:

import pandas as pd

# Create the sample dataframe
df = pd.DataFrame({'init': ['2022-02-02 10:34:00', '2022-02-02 11:34:00', '2022-02-02 13:34:00', '2022-02-02 14:34:00', '2022-02-02 17:34:00'], 'end': ['2022-02-02 11:34:00', '2022-02-02 12:34:00', '2022-02-02 14:34:00', '2022-02-02 15:34:00', '2022-02-02 18:34:00'], 'temp': [34, 12, 23, 22, 18]})

# Convert init and end columns into a datetime type
df['init'] = pd.to_datetime(df['init'])
df['end'] = pd.to_datetime(df['end'])

# Fill the missing values
start, end ='2022-02-02 09:34:00', '2022-02-02 18:34:00'
hr = pd.date_range(start, end, freq='H')
df_hr = pd.DataFrame(zip(hr, hr   pd.Timedelta(hours=1)), columns=['init', 'end'])
df = df_hr.merge(df, how='left', on=['init', 'end']).fillna(0)