I have a dataframe with 2 timestamp columns ('start' and 'end'). I want to explode the rows of the dataframe such that I can split the timestamp by hours.
Here is an example:
import pandas as pd
df = pd.DataFrame({'id_kanban': [244],'component': ['A'],'start': ['2021-02-02 11:03:18'], 'end': ['2021-02-02 13:33:28']})
print(df)
The output I´m getting is:
df1 = pd.DataFrame({
'id_kanban': [244, 244, 244, 244],
'component': ['A', 'A', 'A', 'A'],
'start': ['2021-02-02 11:03:18', '2021-02-02 01:00:00',
'2021-02-02 01:00:00', '2021-02-02 01:00:00'],
'end': ['2021-02-02 01:00:00', '2021-02-02 01:00:00',
'2021-02-02 01:00:00','2021-02-02 13:33:28']})
However, the output I want to get is:
df2 = pd.DataFrame({
'id_kanban': [244, 244, 244],
'component': ['A', 'A', 'A'],
'start': ['2021-02-02 11:03:18', '2021-02-02 12:00:00',
'2021-02-02 13:00:00'],
'end': ['2021-02-02 12:00:00', '2021-02-02 13:00:00',
'2021-02-02 13:33:28']})
The code I'm using is similar to the one found in a similar question: How can I split the difference between two timestamps that contain more than one month in a Pandas DataFrame
This is the code I´m trying:
def find_interval(sr):
dti = pd.date_range(sr['start'], sr['end'], freq='H').normalize() \
pd.Timedelta(hours=1)
return list(zip([sr['start']] dti.tolist(), dti.tolist() [sr['end']]))
df1 = df.apply(find_interval, axis=1).explode().apply(pd.Series)
df1 = df.drop(columns=['start', 'end']) \
.join(df1).rename(columns={0: 'start', 1: 'end'})
Any ideas on how to output the correct dataframe? Thanks
CodePudding user response:
Try as follows:
# floor start time, ceil end time, and set `inclusive` to "neither"
def find_interval(sr):
dti = pd.date_range(sr['start'].floor('H'), sr['end'].ceil('H'), freq='H',
inclusive='neither')
return list(zip([sr['start']] dti.tolist(), dti.tolist() [sr['end']]))
# turn your strings into timestamps
df[['start','end']] = df[['start','end']].apply(pd.to_datetime)
df1 = df.apply(find_interval, axis=1).explode().apply(pd.Series)
# add reset_index here
df1 = df.drop(columns=['start', 'end']) \
.join(df1).rename(columns={0: 'start', 1: 'end'}).reset_index(drop=True)
print(df1)
id_kanban component start end
0 244 A 2021-02-02 11:03:18 2021-02-02 12:00:00
1 244 A 2021-02-02 12:00:00 2021-02-02 13:00:00
2 244 A 2021-02-02 13:00:00 2021-02-02 13:33:28
CodePudding user response:
No reason not to use @ouroboros1 answer but as an alternative approach, which may be a bit more intuitive, here is a solution using piso
- a library for working with pandas intervals which has a function to split intervals at given time points.
setup
import pandas as pd
import piso
df = pd.DataFrame({'id_kanban': [244],'component': ['A'],'start': ['2021-02-02 11:03:18'], 'end': ['2021-02-02 13:33:28']})
df[["start", "end"]] =df[["start", "end"]].apply(pd.to_datetime)
solution
Given the addition of id_kanban
and component
columns I'm assuming you have a larger dataset and may be wanting to group on these columns and split on the hour, so this is the approach given below.
def break_hourly(df):
# create the hourly breakpoints
hour_timestamps = pd.date_range(
df["start"].min().ceil("H"),
df["end"].max().floor("H"),
freq="H",
)
# create pandas.IntervalIndex from time columns and split at breakpoints, resulting in another interval index
ii = piso.split(
pd.IntervalIndex.from_arrays(df["start"], df["end"]),
hour_timestamps
)
# create dataframe from new interval index
return pd.DataFrame({
"start":ii.left,
"end":ii.right,
})
result = (
df.groupby(["id_kanban", "component"])
.apply(break_hourly)
.droplevel(-1) # remove redundant index level introduced by groupby
.reset_index() # index -> columns
)
Disclaimer: I am the author of piso.