Home > database >  Pandas dataframe create new rows based on condition from another column
Pandas dataframe create new rows based on condition from another column

Time:06-09

I have dataframe:

data = {'startTime':['01-06-2010 09:00:00', '13-02-2016 09:00:00', '18-03-2018 09:00:00', '23-05-2011 09:00:00'], 'endTime':['02-06-2010 17:00:00', '14-02-2016 17:00:00', '19-03-2018 17:00:00', '24-05-2011 17:00:00'], 'durationInMinutes': [1440, 1440, 1440, 1440]}

df = pd.DataFrame(data)

I want to make additional rows by dividing 1440 minutes into equal intervals of 8 hours per day. So, 1440 minutes would be equal to 3 days (3 additional rows) (9am - 5pm). Minutes can be more than 1440. New rows in startTime and endTime would be:

startTime               endTime

01-06-2010 09:00:00     01-06-2010 17:00:00
02-06-2010 09:00:00     02-06-2010 17:00:00
03-06-2010 09:00:00     03-06-2010 17:00:00
13-02-2016 09:00:00     13-02-2016 17:00:00
14-02-2016 09:00:00     14-02-2016 17:00:00
15-02-2016 09:00:00     15-02-2016 17:00:00

Can anyone please help me with this? Thanks.

CodePudding user response:

IIUC, you can use:

# ensure datetime
df[['startTime', 'endTime']] = df[['startTime', 'endTime']].apply(pd.to_datetime, dayfirst=True)

# compute number of rows in days
extra = np.ceil(df['durationInMinutes'].div(60*8)).astype(int)
# compute a shift ( 0,  1,  2days etc.)
shift = extra.repeat(extra).groupby(level=0).cumcount().mul(pd.Timedelta('1day'))

# duplicate the rows
df2 = df.loc[df.index.repeat(extra)].reset_index(drop=True)

# add the shift
df2[['startTime', 'endTime']] = df2[['startTime', 'endTime']].add(shift.values, axis=0)

print(df2)

output:

             startTime             endTime  durationInMinutes
0  2010-06-01 09:00:00 2010-06-02 17:00:00               1440
1  2010-06-02 09:00:00 2010-06-03 17:00:00               1440
2  2010-06-03 09:00:00 2010-06-04 17:00:00               1440
3  2016-02-13 09:00:00 2016-02-14 17:00:00               1440
4  2016-02-14 09:00:00 2016-02-15 17:00:00               1440
5  2016-02-15 09:00:00 2016-02-16 17:00:00               1440
6  2018-03-18 09:00:00 2018-03-19 17:00:00               1440
7  2018-03-19 09:00:00 2018-03-20 17:00:00               1440
8  2018-03-20 09:00:00 2018-03-21 17:00:00               1440
9  2011-05-23 09:00:00 2011-05-24 17:00:00               1440
10 2011-05-24 09:00:00 2011-05-25 17:00:00               1440
11 2011-05-25 09:00:00 2011-05-26 17:00:00               1440

CodePudding user response:

UPDATED: Here is a way to do what you have asked in your question and clarified in your comments:

df['days'] = df.durationInMinutes // (8 * 60)   
    (df.durationInMinutes % (8 * 60) > 0).astype(int)
df['durationInMinutes'] = df.apply(lambda x: [8 * 60] * (x.days - 1)   
    [x.durationInMinutes % (8 * 60)   
    (x.durationInMinutes % (8 * 60) == 0) * 8 * 60], axis=1)
df['daysToAdd'] = df.days.apply(lambda x: range(x))
df = df.explode(['durationInMinutes', 'daysToAdd'])
df.startTime = (pd.to_datetime(df.startTime, dayfirst=True).astype('int64')   
        df.daysToAdd * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df.endTime = (df.startTime.astype('int64')   
        df.durationInMinutes * 60*1_000_000_000).astype('datetime64[ns]')
df = df.drop(columns=['days', 'daysToAdd']).reset_index(drop=True)

Explanation:

  • add column days with number of result rows corresponding to each input row.
  • update durationInMinutes column to contain a list for each row of the minutes in each result row originating from the input row.
  • add column daysToAdd with a list for each row of the number of days to add to the startTime for each result row originating from the input row.
  • use explode() to create result rows with one value each from the lists in durationInMinutes and daysToAdd.
  • add the nanosecond equivalent of daysToAdd to startTime.
  • update endTime to be startTime plus the nanosecond equivalent of durationInMinutes.
  • use drop() to eliminate unneeded columns, and use reset_index() to get an integer range index starting at 0 that advances by 1 for each row.

Input:

             startTime              endTime  durationInMinutes
0  01-06-2010 09:00:00  02-06-2010 17:00:00               1445
1  13-02-2016 09:00:00  14-02-2016 17:00:00               1435
2  18-03-2018 09:00:00  19-03-2018 17:00:00               1440
3  23-05-2011 09:00:00  24-05-2011 17:00:00               1440

Output:

             startTime             endTime durationInMinutes
0  2010-06-01 09:00:00 2010-06-01 17:00:00               480
1  2010-06-02 09:00:00 2010-06-02 17:00:00               480
2  2010-06-03 09:00:00 2010-06-03 17:00:00               480
3  2010-06-04 09:00:00 2010-06-04 09:05:00                 5
4  2016-02-13 09:00:00 2016-02-13 17:00:00               480
5  2016-02-14 09:00:00 2016-02-14 17:00:00               480
6  2016-02-15 09:00:00 2016-02-15 16:55:00               475
7  2018-03-18 09:00:00 2018-03-18 17:00:00               480
8  2018-03-19 09:00:00 2018-03-19 17:00:00               480
9  2018-03-20 09:00:00 2018-03-20 17:00:00               480
10 2011-05-23 09:00:00 2011-05-23 17:00:00               480
11 2011-05-24 09:00:00 2011-05-24 17:00:00               480
12 2011-05-25 09:00:00 2011-05-25 17:00:00               480

ORIGINAL ANSWER:

Here's a way to do what your question asks:

df = pd.concat([df.assign(durationInMinutes=df.durationInMinutes/3, 
    orig_row=i).reset_index() for i in range(3)])
for col in ['startTime', 'endTime']:
    df[col] = (pd.to_datetime(df[col], dayfirst=True).astype('int64')   
        df.orig_row * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df = df.sort_values('index').drop(columns=['index', 'orig_row'])

Explanation:

  • overwrite the durationInMinutes column by dividing its contents by 3.
  • create 3 copies of df, each with a new column orig_row containing an integer corresponding to which number the copy is (0, 1 or 2).
  • for each of startTime and endTime, convert the string value to a datetime in nanoseconds and add the nanosecond equivalent of the number of days in orig_row to it (24 hours * 60 minutes * 60 seconds * 1bn nanoseconds).
  • sort by the original index (so consecutive days are grouped together) and drop the intermediate columns.

Input:

             startTime              endTime  durationInMinutes
0  01-06-2010 09:00:00  02-06-2010 17:00:00               1440
1  13-02-2016 09:00:00  14-02-2016 17:00:00               1440
2  18-03-2018 09:00:00  19-03-2018 17:00:00               1440
3  23-05-2011 09:00:00  24-05-2011 17:00:00               1440

Output:

            startTime             endTime  durationInMinutes
0 2010-06-01 09:00:00 2010-06-02 17:00:00              480.0
0 2010-06-02 09:00:00 2010-06-03 17:00:00              480.0
0 2010-06-03 09:00:00 2010-06-04 17:00:00              480.0
1 2016-02-13 09:00:00 2016-02-14 17:00:00              480.0
1 2016-02-14 09:00:00 2016-02-15 17:00:00              480.0
1 2016-02-15 09:00:00 2016-02-16 17:00:00              480.0
2 2018-03-18 09:00:00 2018-03-19 17:00:00              480.0
2 2018-03-19 09:00:00 2018-03-20 17:00:00              480.0
2 2018-03-20 09:00:00 2018-03-21 17:00:00              480.0
3 2011-05-23 09:00:00 2011-05-24 17:00:00              480.0
3 2011-05-24 09:00:00 2011-05-25 17:00:00              480.0
3 2011-05-25 09:00:00 2011-05-26 17:00:00              480.0
  • Related