I have dataframe:
data = {'startTime':['01-06-2010 09:00:00', '13-02-2016 09:00:00', '18-03-2018 09:00:00', '23-05-2011 09:00:00'], 'endTime':['02-06-2010 17:00:00', '14-02-2016 17:00:00', '19-03-2018 17:00:00', '24-05-2011 17:00:00'], 'durationInMinutes': [1440, 1440, 1440, 1440]}
df = pd.DataFrame(data)
I want to make additional rows by dividing 1440 minutes into equal intervals of 8 hours per day. So, 1440 minutes would be equal to 3 days (3 additional rows) (9am - 5pm). Minutes can be more than 1440. New rows in startTime and endTime would be:
startTime endTime
01-06-2010 09:00:00 01-06-2010 17:00:00
02-06-2010 09:00:00 02-06-2010 17:00:00
03-06-2010 09:00:00 03-06-2010 17:00:00
13-02-2016 09:00:00 13-02-2016 17:00:00
14-02-2016 09:00:00 14-02-2016 17:00:00
15-02-2016 09:00:00 15-02-2016 17:00:00
Can anyone please help me with this? Thanks.
CodePudding user response:
IIUC, you can use:
# ensure datetime
df[['startTime', 'endTime']] = df[['startTime', 'endTime']].apply(pd.to_datetime, dayfirst=True)
# compute number of rows in days
extra = np.ceil(df['durationInMinutes'].div(60*8)).astype(int)
# compute a shift ( 0, 1, 2days etc.)
shift = extra.repeat(extra).groupby(level=0).cumcount().mul(pd.Timedelta('1day'))
# duplicate the rows
df2 = df.loc[df.index.repeat(extra)].reset_index(drop=True)
# add the shift
df2[['startTime', 'endTime']] = df2[['startTime', 'endTime']].add(shift.values, axis=0)
print(df2)
output:
startTime endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-02 17:00:00 1440
1 2010-06-02 09:00:00 2010-06-03 17:00:00 1440
2 2010-06-03 09:00:00 2010-06-04 17:00:00 1440
3 2016-02-13 09:00:00 2016-02-14 17:00:00 1440
4 2016-02-14 09:00:00 2016-02-15 17:00:00 1440
5 2016-02-15 09:00:00 2016-02-16 17:00:00 1440
6 2018-03-18 09:00:00 2018-03-19 17:00:00 1440
7 2018-03-19 09:00:00 2018-03-20 17:00:00 1440
8 2018-03-20 09:00:00 2018-03-21 17:00:00 1440
9 2011-05-23 09:00:00 2011-05-24 17:00:00 1440
10 2011-05-24 09:00:00 2011-05-25 17:00:00 1440
11 2011-05-25 09:00:00 2011-05-26 17:00:00 1440
CodePudding user response:
UPDATED: Here is a way to do what you have asked in your question and clarified in your comments:
df['days'] = df.durationInMinutes // (8 * 60)
(df.durationInMinutes % (8 * 60) > 0).astype(int)
df['durationInMinutes'] = df.apply(lambda x: [8 * 60] * (x.days - 1)
[x.durationInMinutes % (8 * 60)
(x.durationInMinutes % (8 * 60) == 0) * 8 * 60], axis=1)
df['daysToAdd'] = df.days.apply(lambda x: range(x))
df = df.explode(['durationInMinutes', 'daysToAdd'])
df.startTime = (pd.to_datetime(df.startTime, dayfirst=True).astype('int64')
df.daysToAdd * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df.endTime = (df.startTime.astype('int64')
df.durationInMinutes * 60*1_000_000_000).astype('datetime64[ns]')
df = df.drop(columns=['days', 'daysToAdd']).reset_index(drop=True)
Explanation:
- add column
days
with number of result rows corresponding to each input row. - update
durationInMinutes
column to contain a list for each row of the minutes in each result row originating from the input row. - add column
daysToAdd
with a list for each row of the number of days to add to the startTime for each result row originating from the input row. - use
explode()
to create result rows with one value each from the lists indurationInMinutes
anddaysToAdd
. - add the nanosecond equivalent of
daysToAdd
tostartTime
. - update
endTime
to bestartTime
plus the nanosecond equivalent ofdurationInMinutes
. - use
drop()
to eliminate unneeded columns, and usereset_index()
to get an integer range index starting at 0 that advances by 1 for each row.
Input:
startTime endTime durationInMinutes
0 01-06-2010 09:00:00 02-06-2010 17:00:00 1445
1 13-02-2016 09:00:00 14-02-2016 17:00:00 1435
2 18-03-2018 09:00:00 19-03-2018 17:00:00 1440
3 23-05-2011 09:00:00 24-05-2011 17:00:00 1440
Output:
startTime endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-01 17:00:00 480
1 2010-06-02 09:00:00 2010-06-02 17:00:00 480
2 2010-06-03 09:00:00 2010-06-03 17:00:00 480
3 2010-06-04 09:00:00 2010-06-04 09:05:00 5
4 2016-02-13 09:00:00 2016-02-13 17:00:00 480
5 2016-02-14 09:00:00 2016-02-14 17:00:00 480
6 2016-02-15 09:00:00 2016-02-15 16:55:00 475
7 2018-03-18 09:00:00 2018-03-18 17:00:00 480
8 2018-03-19 09:00:00 2018-03-19 17:00:00 480
9 2018-03-20 09:00:00 2018-03-20 17:00:00 480
10 2011-05-23 09:00:00 2011-05-23 17:00:00 480
11 2011-05-24 09:00:00 2011-05-24 17:00:00 480
12 2011-05-25 09:00:00 2011-05-25 17:00:00 480
ORIGINAL ANSWER:
Here's a way to do what your question asks:
df = pd.concat([df.assign(durationInMinutes=df.durationInMinutes/3,
orig_row=i).reset_index() for i in range(3)])
for col in ['startTime', 'endTime']:
df[col] = (pd.to_datetime(df[col], dayfirst=True).astype('int64')
df.orig_row * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df = df.sort_values('index').drop(columns=['index', 'orig_row'])
Explanation:
- overwrite the
durationInMinutes
column by dividing its contents by 3. - create 3 copies of
df
, each with a new columnorig_row
containing an integer corresponding to which number the copy is (0, 1 or 2). - for each of
startTime
andendTime
, convert the string value to a datetime in nanoseconds and add the nanosecond equivalent of the number of days inorig_row
to it (24 hours * 60 minutes * 60 seconds * 1bn nanoseconds
). - sort by the original index (so consecutive days are grouped together) and drop the intermediate columns.
Input:
startTime endTime durationInMinutes
0 01-06-2010 09:00:00 02-06-2010 17:00:00 1440
1 13-02-2016 09:00:00 14-02-2016 17:00:00 1440
2 18-03-2018 09:00:00 19-03-2018 17:00:00 1440
3 23-05-2011 09:00:00 24-05-2011 17:00:00 1440
Output:
startTime endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-02 17:00:00 480.0
0 2010-06-02 09:00:00 2010-06-03 17:00:00 480.0
0 2010-06-03 09:00:00 2010-06-04 17:00:00 480.0
1 2016-02-13 09:00:00 2016-02-14 17:00:00 480.0
1 2016-02-14 09:00:00 2016-02-15 17:00:00 480.0
1 2016-02-15 09:00:00 2016-02-16 17:00:00 480.0
2 2018-03-18 09:00:00 2018-03-19 17:00:00 480.0
2 2018-03-19 09:00:00 2018-03-20 17:00:00 480.0
2 2018-03-20 09:00:00 2018-03-21 17:00:00 480.0
3 2011-05-23 09:00:00 2011-05-24 17:00:00 480.0
3 2011-05-24 09:00:00 2011-05-25 17:00:00 480.0
3 2011-05-25 09:00:00 2011-05-26 17:00:00 480.0