I have a DataFrame:
| init | end | temp
2022-02-02 10:34:00 | 2022-02-02 11:34:00 | 34
2022-02-02 11:34:00 | 2022-02-02 12:34:00 | 12
2022-02-02 13:34:00 | 2022-02-02 14:34:00 | 23
2022-02-02 14:34:00 | 2022-02-02 15:34:00 | 22
2022-02-02 17:34:00 | 2022-02-02 18:34:00 | 18
I need to fill in the missing times (the end of one is the beginning of another) from a start and end date, if I have start=2022-02-02 09:34:00 end=2022-02-02 18:34:00
I need to fill the DataFrame as follows:
| init | end | temp
**2022-02-02 09:34:00 | 2022-02-02 11:34:00 | 0**
2022-02-02 10:34:00 | 2022-02-02 11:34:00 | 34
2022-02-02 11:34:00 | 2022-02-02 12:34:00 | 12
**2022-02-02 12:34:00 | 2022-02-02 11:34:00 | 0**
2022-02-02 13:34:00 | 2022-02-02 14:34:00 | 23
2022-02-02 14:34:00 | 2022-02-02 15:34:00 | 22
**2022-02-02 15:34:00 | 2022-02-02 11:34:00 | 0**
**2022-02-02 16:34:00 | 2022-02-02 11:34:00 | 0**
2022-02-02 17:34:00 | 2022-02-02 18:34:00 | 18
**2022-02-02 18:34:00 | 2022-02-02 11:34:00 | 0**
CodePudding user response:
You can make temporal dataframe which consist of datetime period, then you can OUTER JOIN (using pd.merge()
), as follows:
import pandas as pd
from datetime import timedelta
df = pd.DataFrame({
'init': ['2022-02-02 10:34:00', '2022-02-02 11:34:00', '2022-02-02 13:34:00', '2022-02-02 14:34:00', '2022-02-02 17:34:00'],
'end': ['2022-02-02 11:34:00', '2022-02-02 12:34:00', '2022-02-02 14:34:00', '2022-02-02 15:34:00', '2022-02-02 18:34:00'],
'temp': [34, 12, 23, 22, 18],
})
# to convert str to datetime type for init and end columns
df['init'] = pd.to_datetime(df['init'])
df['end'] = pd.to_datetime(df['end'])
# to create temporal dataframe for additional rows
tmp_df = pd.DataFrame()
tmp_df['init'] = pd.date_range(start=df.iloc[0]['init'] - timedelta(hours=1), end=df.iloc[-1]['end'], freq="H")
# to create final result
result = pd.merge(df, tmp_df, on='init', how='outer')
result = result.sort_values(by=['init']).reset_index(drop=True)
#result['end'] = result['init'] timedelta(hours=1) # use this if you make end value as init 1 hour
result['end'] = result['end'].apply(lambda x: datetime(2020, 2, 2, 11, 34, 0) if x is pd.NaT else x)
result['temp'] = result['temp'].fillna(0) # convert NaN to 0
print(result)
This will print what you expected:
>>> result
init end temp
0 2022-02-02 09:34:00 2020-02-02 11:34:00 0.0
1 2022-02-02 10:34:00 2022-02-02 11:34:00 34.0
2 2022-02-02 11:34:00 2022-02-02 12:34:00 12.0
3 2022-02-02 12:34:00 2020-02-02 11:34:00 0.0
4 2022-02-02 13:34:00 2022-02-02 14:34:00 23.0
5 2022-02-02 14:34:00 2022-02-02 15:34:00 22.0
6 2022-02-02 15:34:00 2020-02-02 11:34:00 0.0
7 2022-02-02 16:34:00 2020-02-02 11:34:00 0.0
8 2022-02-02 17:34:00 2022-02-02 18:34:00 18.0
9 2022-02-02 18:34:00 2020-02-02 11:34:00 0.0
If you want to make "end" column as "init 1 hour", then use this code (already commented in the code), #result['end'] = result['init'] timedelta(hours=1)
, instead of result['end'] = result['end'].apply(lambda x: datetime(2020, 2, 2, 11, 34, 0) if x is pd.NaT else x)
.
This will print following:
init end temp
0 2022-02-02 09:34:00 2022-02-02 10:34:00 0.0
1 2022-02-02 10:34:00 2022-02-02 11:34:00 34.0
2 2022-02-02 11:34:00 2022-02-02 12:34:00 12.0
3 2022-02-02 12:34:00 2022-02-02 13:34:00 0.0
4 2022-02-02 13:34:00 2022-02-02 14:34:00 23.0
5 2022-02-02 14:34:00 2022-02-02 15:34:00 22.0
6 2022-02-02 15:34:00 2022-02-02 16:34:00 0.0
7 2022-02-02 16:34:00 2022-02-02 17:34:00 0.0
8 2022-02-02 17:34:00 2022-02-02 18:34:00 18.0
9 2022-02-02 18:34:00 2022-02-02 19:34:00 0.0
CodePudding user response:
You can use a combination of pd.date_range
and pd.Timedelta
:
import pandas as pd
# Create the sample dataframe
df = pd.DataFrame({'init': ['2022-02-02 10:34:00', '2022-02-02 11:34:00', '2022-02-02 13:34:00', '2022-02-02 14:34:00', '2022-02-02 17:34:00'], 'end': ['2022-02-02 11:34:00', '2022-02-02 12:34:00', '2022-02-02 14:34:00', '2022-02-02 15:34:00', '2022-02-02 18:34:00'], 'temp': [34, 12, 23, 22, 18]})
# Convert init and end columns into a datetime type
df['init'] = pd.to_datetime(df['init'])
df['end'] = pd.to_datetime(df['end'])
# Fill the missing values
start, end ='2022-02-02 09:34:00', '2022-02-02 18:34:00'
hr = pd.date_range(start, end, freq='H')
df_hr = pd.DataFrame(zip(hr, hr pd.Timedelta(hours=1)), columns=['init', 'end'])
df = df_hr.merge(df, how='left', on=['init', 'end']).fillna(0)