I am new in python and pandas and also in stackoverflow so I apologize for any mistakes I make in advance.
I have this dataframe
df = pd.DataFrame(
data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
['Donald Trump', 'German', '2021-9-23 14:58:01','2021-9-23 15:00:05', 124 ],
['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
df['interval_start'] = pd.to_datetime(df['interval_start'])
df['interval_end'] = pd.to_datetime(df['interval_end'])
output is
specialist language interval_start interval_end status_duration
0 Donald Trump German 2021-09-23 14:28:00 2021-09-23 14:58:00 1800
1 Donald Trump German 2021-09-23 14:58:01 2021-09-23 15:00:05 125
2 Donald Trump German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
3 Monica Lewinsky German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
and my desired outcome is to have something like in below
specialist language interval status_duration
0 Donald Trump German 2021-9-23 14:15:00 120
1 Donald Trump German 2021-9-23 14:30:00 900
2 Donald Trump German 2021-9-23 14:45:00 899
3 Donald Trump German 2021-9-23 15:00:00 5
4 Donald Trump German 2021-9-24 10:00:00 600
5 Donald Trump German 2021-9-24 10:15:00 30
6 Monica Lewinsky German 2021-9-24 10:15:00 30
I have this code from another topic link
ref = (df.groupby(["specialist", "Language", pd.Grouper(key="Interval Start", freq="D")], as_index=False)
.agg(status_duration=("status_duration", lambda d: [*([900]*(d.iat[0]//900)), d.iat[0]%900]),
Interval=("Interval Start", "first"))
.explode("status_duration"))
ref["Interval"] = ref["Interval"].dt.floor("15min") pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit="sec")
But it does not take "interval_start" into consideration, I need to check first if the status_duration will remain on same 15 mins interval or not. Hope somebody can help as it is a very advanced problem for me and i am working on it for more than 10 days.
CodePudding user response:
After learning a bit more, I came up with another (better) solution using groupby()
and explode()
. I add this as a second answer since my first one, while maybe a bit complicated, still works and I am also referencing a part of it in this answer.
I first added a few new columns to split up the status_duration
into the first slice and the rest and replaced the original value of status_duration
with an according 2-element list:
df['first'] = ((df['interval_start'] pd.Timedelta('1sec')).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['status_duration'] = df[['first','rest']].values.tolist()
df['status_duration'] = df['status_duration'].apply(lambda x: x if x[1] > 0 else [sum(x),0])
This gives you the following prepared dataframe:
specialist language interval_start ... status_duration first rest
0 Donald Trump German 2021-09-23 14:28:00 ... [120, 1680] 120 1680
1 Donald Trump German 2021-09-23 14:58:01 ... [119, 5] 119 5
2 Donald Trump German 2021-09-24 10:05:00 ... [600, 30] 600 30
3 Monica Lewinsky German 2021-09-24 10:05:00 ... [30, 0] 600 -570
On this, you can now perform a groupby()
and explode()
similar to the code in your question. Afterwards you round the intervals and group again to merge the intervals that have multiple entries now because of the explode()
. To clean up, I dropped the rows with duration 0
and reset the index:
ref = df.groupby(['specialist', 'language', pd.Grouper(key='interval_start', freq='T')], as_index=False)
.agg(status_duration=('status_duration', lambda d: [d.iat[0][0],*([900]*(d.iat[0][1]//900)), d.iat[0][1]%900]),interval_start=('interval_start', 'first'))
.explode('status_duration')
ref['interval_start'] = ref['interval_start'].dt.floor('15min') pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit='sec')
ref = ref.groupby(['specialist', 'language', 'interval_start']).sum()
ref = ref[ref.status_duration != 0].reset_index()
This gives you your desired output:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 899
3 Donald Trump German 2021-09-23 15:00:00 5
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
Note: The problem I described in the other answer, that the final grouping step could result in a status_duration
> 900 should not be possible with real data, since a specialist shouldn't be able to start a second interval before the first one ends. So this is a case you do not need to handle after all.
CodePudding user response:
Not sure whether this isn't unnecessarily convoluted, but it does get the job done. There are probably nicer, more pythonic approaches though...
I first added a few new columns to the df with the resulting number of intervals that the status_duration
suggests, the number of minutes that fit in the first interval and the remainder of the duration:
df['len'] = 1 (df['status_duration']-1)//900
df['first'] = ((df['interval_start'] timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
Then, we add one additional interval for each row with a positive rest and a first slice < 900:
df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] 1, df['len'])
Now, I create the new dataframe by using np.repeat()
to duplicate the rows so that I have the right number according to the number of intervals and list comprehensions to build the interval_start
and status_duration
columns using df.iterrows()
:
new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
'language': np.repeat(df['language'], df['len']),
'interval_start': [el for sublist in [[x['interval_start'] timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
'status_duration': [el for sublist in [([x['first']] [900]*(x['len']-2) [x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
})
Then we round the interval start time
new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
All that's left to do now is grouping and resetting the index:
new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
Result:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 899
3 Donald Trump German 2021-09-23 15:00:00 5
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
One problem remains: The last grouping step could result in 15-minute intervals that through the grouping again get a status_duration
> 900.
Imagine your second row of your input data had an interval_start
that was 2 seconds earlier:
specialist language interval_start interval_end status_duration
0 Donald Trump German 2021-09-23 14:28:00 2021-09-23 14:58:00 1800
1 Donald Trump German 2021-09-23 14:57:59 2021-09-23 15:00:03 124
2 Donald Trump German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
3 Monica Lewinsky German 2021-09-24 10:05:00 2021-09-24 10:05:30 30
Then you'd wind up with a status_duration
of 901
after grouping:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 901
3 Donald Trump German 2021-09-23 15:00:00 3
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
This is complicated by the fact that this "splilling over" can happen multiple times. One approach would be to repeat the above steps until no new_df
rows with status_duration
> 900 remain. This will carry over the overflow.
Full example:
import pandas as pd
import numpy as np
from datetime import timedelta
input_df = pd.DataFrame(
data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
['Donald Trump', 'German', '2021-9-23 14:57:59','2021-9-23 15:00:03', 124 ],
['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
input_df['interval_start'] = pd.to_datetime(input_df['interval_start'])
input_df['interval_end'] = pd.to_datetime(input_df['interval_end'])
def build_df(df):
while df['status_duration'].gt(900).any():
df['len'] = 1 (df['status_duration']-1)//900
df['first'] = ((df['interval_start'] timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] 1, df['len'])
new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
'language': np.repeat(df['language'], df['len']),
'interval_start': [el for sublist in [[x['interval_start'] timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
'status_duration': [el for sublist in [([x['first']] [900]*(x['len']-2) [x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
})
new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
new_df = new_df[new_df.status_duration != 0]
new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
df = new_df.copy()
return df
output_df = build_df(input_df)
Result:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 900
3 Donald Trump German 2021-09-23 15:00:00 4
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
Looking at it now, I would guess that there should probably be an easier way, but this is all I got...