Pandas dataframe create new rows based on condition from another column-CodePudding

I have dataframe:

data = {'startTime':['01-06-2010 09:00:00', '13-02-2016 09:00:00', '18-03-2018 09:00:00', '23-05-2011 09:00:00'], 'endTime':['02-06-2010 17:00:00', '14-02-2016 17:00:00', '19-03-2018 17:00:00', '24-05-2011 17:00:00'], 'durationInMinutes': [1440, 1440, 1440, 1440]}

df = pd.DataFrame(data)

I want to make additional rows by dividing 1440 minutes into equal intervals of 8 hours per day. So, 1440 minutes would be equal to 3 days (3 additional rows) (9am - 5pm). Minutes can be more than 1440. New rows in startTime and endTime would be:

startTime               endTime

01-06-2010 09:00:00     01-06-2010 17:00:00
02-06-2010 09:00:00     02-06-2010 17:00:00
03-06-2010 09:00:00     03-06-2010 17:00:00
13-02-2016 09:00:00     13-02-2016 17:00:00
14-02-2016 09:00:00     14-02-2016 17:00:00
15-02-2016 09:00:00     15-02-2016 17:00:00

Can anyone please help me with this? Thanks.

CodePudding user response：

IIUC, you can use:

# ensure datetime
df[['startTime', 'endTime']] = df[['startTime', 'endTime']].apply(pd.to_datetime, dayfirst=True)

# compute number of rows in days
extra = np.ceil(df['durationInMinutes'].div(60*8)).astype(int)
# compute a shift ( 0,  1,  2days etc.)
shift = extra.repeat(extra).groupby(level=0).cumcount().mul(pd.Timedelta('1day'))

# duplicate the rows
df2 = df.loc[df.index.repeat(extra)].reset_index(drop=True)

# add the shift
df2[['startTime', 'endTime']] = df2[['startTime', 'endTime']].add(shift.values, axis=0)

print(df2)

output:

             startTime             endTime  durationInMinutes
0  2010-06-01 09:00:00 2010-06-02 17:00:00               1440
1  2010-06-02 09:00:00 2010-06-03 17:00:00               1440
2  2010-06-03 09:00:00 2010-06-04 17:00:00               1440
3  2016-02-13 09:00:00 2016-02-14 17:00:00               1440
4  2016-02-14 09:00:00 2016-02-15 17:00:00               1440
5  2016-02-15 09:00:00 2016-02-16 17:00:00               1440
6  2018-03-18 09:00:00 2018-03-19 17:00:00               1440
7  2018-03-19 09:00:00 2018-03-20 17:00:00               1440
8  2018-03-20 09:00:00 2018-03-21 17:00:00               1440
9  2011-05-23 09:00:00 2011-05-24 17:00:00               1440
10 2011-05-24 09:00:00 2011-05-25 17:00:00               1440
11 2011-05-25 09:00:00 2011-05-26 17:00:00               1440

CodePudding user response：

UPDATED: Here is a way to do what you have asked in your question and clarified in your comments:

df['days'] = df.durationInMinutes // (8 * 60)   
    (df.durationInMinutes % (8 * 60) > 0).astype(int)
df['durationInMinutes'] = df.apply(lambda x: [8 * 60] * (x.days - 1)   
    [x.durationInMinutes % (8 * 60)   
    (x.durationInMinutes % (8 * 60) == 0) * 8 * 60], axis=1)
df['daysToAdd'] = df.days.apply(lambda x: range(x))
df = df.explode(['durationInMinutes', 'daysToAdd'])
df.startTime = (pd.to_datetime(df.startTime, dayfirst=True).astype('int64')   
        df.daysToAdd * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df.endTime = (df.startTime.astype('int64')   
        df.durationInMinutes * 60*1_000_000_000).astype('datetime64[ns]')
df = df.drop(columns=['days', 'daysToAdd']).reset_index(drop=True)

Explanation:

add column days with number of result rows corresponding to each input row.
update durationInMinutes column to contain a list for each row of the minutes in each result row originating from the input row.
add column daysToAdd with a list for each row of the number of days to add to the startTime for each result row originating from the input row.
use explode() to create result rows with one value each from the lists in durationInMinutes and daysToAdd.
add the nanosecond equivalent of daysToAdd to startTime.
update endTime to be startTime plus the nanosecond equivalent of durationInMinutes.
use drop() to eliminate unneeded columns, and use reset_index() to get an integer range index starting at 0 that advances by 1 for each row.

Input:

             startTime              endTime  durationInMinutes
0  01-06-2010 09:00:00  02-06-2010 17:00:00               1445
1  13-02-2016 09:00:00  14-02-2016 17:00:00               1435
2  18-03-2018 09:00:00  19-03-2018 17:00:00               1440
3  23-05-2011 09:00:00  24-05-2011 17:00:00               1440

Output:

             startTime             endTime durationInMinutes
0  2010-06-01 09:00:00 2010-06-01 17:00:00               480
1  2010-06-02 09:00:00 2010-06-02 17:00:00               480
2  2010-06-03 09:00:00 2010-06-03 17:00:00               480
3  2010-06-04 09:00:00 2010-06-04 09:05:00                 5
4  2016-02-13 09:00:00 2016-02-13 17:00:00               480
5  2016-02-14 09:00:00 2016-02-14 17:00:00               480
6  2016-02-15 09:00:00 2016-02-15 16:55:00               475
7  2018-03-18 09:00:00 2018-03-18 17:00:00               480
8  2018-03-19 09:00:00 2018-03-19 17:00:00               480
9  2018-03-20 09:00:00 2018-03-20 17:00:00               480
10 2011-05-23 09:00:00 2011-05-23 17:00:00               480
11 2011-05-24 09:00:00 2011-05-24 17:00:00               480
12 2011-05-25 09:00:00 2011-05-25 17:00:00               480

ORIGINAL ANSWER:

Here's a way to do what your question asks:

df = pd.concat([df.assign(durationInMinutes=df.durationInMinutes/3, 
    orig_row=i).reset_index() for i in range(3)])
for col in ['startTime', 'endTime']:
    df[col] = (pd.to_datetime(df[col], dayfirst=True).astype('int64')   
        df.orig_row * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df = df.sort_values('index').drop(columns=['index', 'orig_row'])

Explanation:

overwrite the durationInMinutes column by dividing its contents by 3.
create 3 copies of df, each with a new column orig_row containing an integer corresponding to which number the copy is (0, 1 or 2).
for each of startTime and endTime, convert the string value to a datetime in nanoseconds and add the nanosecond equivalent of the number of days in orig_row to it (24 hours * 60 minutes * 60 seconds * 1bn nanoseconds).
sort by the original index (so consecutive days are grouped together) and drop the intermediate columns.

Input:

             startTime              endTime  durationInMinutes
0  01-06-2010 09:00:00  02-06-2010 17:00:00               1440
1  13-02-2016 09:00:00  14-02-2016 17:00:00               1440
2  18-03-2018 09:00:00  19-03-2018 17:00:00               1440
3  23-05-2011 09:00:00  24-05-2011 17:00:00               1440

Output:

            startTime             endTime  durationInMinutes
0 2010-06-01 09:00:00 2010-06-02 17:00:00              480.0
0 2010-06-02 09:00:00 2010-06-03 17:00:00              480.0
0 2010-06-03 09:00:00 2010-06-04 17:00:00              480.0
1 2016-02-13 09:00:00 2016-02-14 17:00:00              480.0
1 2016-02-14 09:00:00 2016-02-15 17:00:00              480.0
1 2016-02-15 09:00:00 2016-02-16 17:00:00              480.0
2 2018-03-18 09:00:00 2018-03-19 17:00:00              480.0
2 2018-03-19 09:00:00 2018-03-20 17:00:00              480.0
2 2018-03-20 09:00:00 2018-03-21 17:00:00              480.0
3 2011-05-23 09:00:00 2011-05-24 17:00:00              480.0
3 2011-05-24 09:00:00 2011-05-25 17:00:00              480.0
3 2011-05-25 09:00:00 2011-05-26 17:00:00              480.0