I tried to use date_range
with a day
frequency on this dataframe:
df = pd.DataFrame({'Start':['2022-06-07 06:24:48','2022-06-07 14:37:16','2022-06-07 08:00:59'],
'End':['2022-06-07 14:07:00','2022-06-08 02:51:21','2022-06-09 13:18:34'],
'Process':['PROD','VORG','STO'],
'Duration_Min':[462.20,734.08,3197.58]})
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
df['difference']=df['End'].dt.date-df['Start'].dt.date
def find_interval(sr):
dti = pd.date_range(sr['Start'], sr['End'], freq='1D').normalize() pd.Timedelta(days=1)
return list(zip([sr['Start']] dti.tolist(), dti.tolist() [sr['End']]))
df1 = df.apply(find_interval, axis=1).explode().apply(pd.Series)
df1 = df.drop(columns=['Start', 'End']).join(df1).rename(columns={0: 'Start', 1: 'End'})
df1['Duration_Min']=(df1["End"]-df1["Start"]).dt.total_seconds().div(60)
What I get is:
Process Duration_Min difference Start End
0 PROD 1055.200000 0 days 2022-06-07 06:24:48 2022-06-08 00:00:00
0 PROD -593.000000 0 days 2022-06-08 00:00:00 2022-06-07 14:07:00
1 VORG 562.733333 1 days 2022-06-07 14:37:16 2022-06-08 00:00:00
1 VORG 171.350000 1 days 2022-06-08 00:00:00 2022-06-08 02:51:21
2 STO 959.016667 2 days 2022-06-07 08:00:59 2022-06-08 00:00:00
2 STO 1440.000000 2 days 2022-06-08 00:00:00 2022-06-09 00:00:00
2 STO 1440.000000 2 days 2022-06-09 00:00:00 2022-06-10 00:00:00
2 STO -641.433333 2 days 2022-06-10 00:00:00 2022-06-09 13:18:34
I would like to cut the events so that new timestamps with new intervals are created when the day changes between Start
and End
. If the difference between the dates is 0 days
I don't need to create new timestamps and with the Timedelta(days=1)
the End
timestamp is mismatched. The column Days
should be corosponding with weekday()
What I want:
Start End Process Duration_Min Days
0 2022-06-07 06:24:48 2022-06-07 14:07:00 PROD 462.200000 1
1 2022-06-07 14:37:16 2022-06-07 23:59:59 VORG 562.716667 1
2 2022-06-08 00:00:00 2022-06-08 02:51:21 VORG 171.350000 2
3 2022-06-07 08:00:59 2022-06-07 23:59:59 STO 959.000000 1
4 2022-06-08 00:00:00 2022-06-08 23:59:59 STO 1439.983333 2
5 2022-06-09 00:00:00 2022-06-09 13:18:34 STO 798.566667 3
How could I achieve this?
CodePudding user response:
You could try:
def find_interval(row):
start, end = row.at["Start"], row.at["End"]
days = pd.date_range(start, end, freq="D", normalize=True).to_list()
if len(days) == 1 or days[-1] != end:
days.append(end)
days[0] = start
return list(zip(days, days[1:]))
result = (
df
.assign(Days=df.apply(find_interval, axis=1))
.explode("Days")
.assign(
Start=lambda df: df["Days"].str[0],
End=lambda df: df["Days"].str[1],
Duration_Min=lambda df:
(df["End"] - df["Start"]).dt.total_seconds().div(60),
Days=lambda df: df.groupby("Process").transform("cumcount") 1
)
)
Result for your df
:
Start End Process Duration_Min Days
0 2022-06-07 06:24:48 2022-06-07 14:07:00 PROD 462.200000 1
1 2022-06-07 14:37:16 2022-06-08 00:00:00 VORG 562.733333 1
1 2022-06-08 00:00:00 2022-06-08 02:51:21 VORG 171.350000 2
2 2022-06-07 08:00:59 2022-06-08 00:00:00 STO 959.016667 1
2 2022-06-08 00:00:00 2022-06-09 00:00:00 STO 1440.000000 2
2 2022-06-09 00:00:00 2022-06-09 13:18:34 STO 798.566667 3
If, as indicated in the comments, the substantial part of df
doesn't need the day-separation, then the following might be better:
m = df["Start"].dt.date < df["End"].dt.date
result = (
df[m]
.assign(Days=df.apply(find_interval, axis=1))
... <see above> ...
)
result = pd.concat([df[~m].assign(Days=1), result]).sort_index()
The .sort_index()
-part is to make sure that the Process
-order is the same as in df
. Remove it, if that's not important.
CodePudding user response:
Alright so the first thing you want to do is have a date column:
df["date"]=df[["Start","End"]].min(axis=1).dt.date
Once you have that, you will now want to groupby according to your relevant columns
df = df.groupby(["date",
"Process"]).agg({"Start":"min","End":"min","Duration_Min":"sum", "Days":"any"}).reset_index()
And you should end up with the relevant dataframe