How to segment and get the time between two dates in pandas?-CodePudding

I want to know how long it has been driven during certain time segments, in this case we want to look at one hour segments. To get the segments I make the following code:

init = '2022-03-10 01:00:00'
end = '2022-03-10 06:00:00'
freq = '1h'

bucket = pd.DataFrame(
        {'start_date': pd.date_range(start=init, end=end, freq=freq)}
    )
bucket['end_date'] = bucket['start_date']   pd.Timedelta(seconds=3600)

bucket

           start_date            end_date
0 2022-03-10 01:00:00 2022-03-10 02:00:00
1 2022-03-10 02:00:00 2022-03-10 03:00:00
2 2022-03-10 03:00:00 2022-03-10 04:00:00
3 2022-03-10 04:00:00 2022-03-10 05:00:00
4 2022-03-10 05:00:00 2022-03-10 06:00:00
5 2022-03-10 06:00:00 2022-03-10 07:00:00

From the database I get all the trips that have been made in the start and end time periods. For this example, the data obtained is:

df = pd.DataFrame({
    'start_date': ['2022-03-10 01:20:00', '2022-03-10 02:18:00', '2022-03-10 02:10:00', '2022-03-10 02:40:00', '2022-03-10 02:45:00', '2022-03-10 03:05:00', '2022-03-10 03:12:00', '2022-03-10 05:30:00'],
    'end_date': ['2022-03-10 01:32:00', '2022-03-10 02:42:00', '2022-03-10 02:23:00', '2022-03-10 03:20:00', '2022-03-10 02:58:00', '2022-03-10 03:28:00', pd.NA, '2022-03-10 05:48:00'],
    'number_of_trip': ["637hui", "384nfj", "102fiu", "948pvc", "473mds", "103fkd", "905783", "498wsq"],
     'id': [1, 2, 3, 4, 5, 6, 7, 8]
})

df = df.replace(pd.NA, datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])


           start_date            end_date number_of_trip  id  seconds
0 2022-03-10 01:20:00 2022-03-10 01:32:00         637hui   1      720
1 2022-03-10 02:18:00 2022-03-10 02:42:00         384nfj   2     1440
2 2022-03-10 02:10:00 2022-03-10 02:23:00         102fiu   3      780
3 2022-03-10 02:40:00 2022-03-10 03:20:00         948pvc   4     2400
4 2022-03-10 02:45:00 2022-03-10 02:58:00         473mds   5      780
5 2022-03-10 03:05:00 2022-03-10 03:28:00         103fkd   6     1380
6 2022-03-10 03:12:00 2022-04-06 17:04:15         905783   7        0
7 2022-03-10 05:30:00 2022-03-10 05:48:00         498wsq   8     1080

Now, I can't do a direct summation like this:

times = df.resample(freq, on='start_date')['seconds'].sum()

start_date
2022-03-10 01:00:00     720
2022-03-10 02:00:00    5400
2022-03-10 03:00:00    1380
2022-03-10 04:00:00       0
2022-03-10 05:00:00    1080

The reason is that it only takes the start date but not the end date, this means that if a trip enters into two segments, it is only added in the first segment when in reality it should add one part in one segment and the other in another segment.

For the segment from 2:00:00 to 3:00:00 there are 4 rows that can enter, but not complete. The rows are:

1 2022-03-10 02:18:00 2022-03-10 02:42:00         384nfj   2     1440
2 2022-03-10 02:10:00 2022-03-10 02:23:00         102fiu   3      780
3 2022-03-10 02:40:00 2022-03-10 03:20:00         948pvc   4     2400 # This is in two segments  2022-03-10 02:00:00 to 2022-03-10 03:00:00 (1200) and 2022-03-10 03:00:00 to 2022-03-10 04:00:00 (1200)
4 2022-03-10 02:45:00 2022-03-10 02:58:00         473mds   5      780

The resample code adds these 4 rows 1440 780 2400 780 giving the result of 5400, however row 3 has time of 2 segments from 2:00-3:00 and from 3:00-4:00, so for the segment 2:00-3:00 the elapsed time must be taken, in this case it must be from 2:40:00 to 3:00:00, which is 1200 seconds, that makes the sum for the 2:00-3:00 segment is 1440 780 1200 780 = 4200

I have some code that gives the values correctly, but it is very slow. is there a way in pandas to improve this?

CODE:

from datetime import timedelta, datetime

import pandas as pd


def bucket_count(seconds_bucket, data, inicio, fin):
    result = pd.DataFrame()
    bucket = pd.DataFrame(
        {'start_date': pd.date_range(start=inicio, end=fin, freq="H"),
         "color": "#FF0000"}
    )
    bucket['end_date'] = bucket['start_date']   pd.Timedelta(seconds=seconds_bucket)

    for index, row_bucket in bucket.iterrows():
        inicio = row_bucket['start_date']
        fin = inicio   timedelta(seconds=seconds_bucket)

        df = data[(
                ((inicio < data['start_date']) & (
                            fin > data['start_date'])) |
                ((data['end_date'] > inicio) & (data['end_date'] < fin)))
        ]

        for index, row in df.iterrows():
            counter = 0
            if row['start_date'] > inicio < fin and row['end_date'] < fin > inicio:
                seconds = int((row['end_date'] - row['start_date']).total_seconds())
                counter  = seconds
            elif row['start_date'] < inicio and row['end_date'] < fin:
                seconds = int((row['end_date'] - inicio).total_seconds())
                counter  = seconds
            elif row['start_date'] > inicio and row['end_date'] > fin:
                seconds = int((fin - row['start_date']).total_seconds())
                counter  = seconds

            row['start_date'] = inicio
            row['end_date'] = fin
            row['seconds'] = counter
            result = pd.concat([result, row.to_frame().T], ignore_index=True)
    return result.groupby(['start_date'])["seconds"].apply(lambda x: x.astype(int).sum()).reset_index()

inicio = '2022-03-10 01:00:00'
fin = '2022-03-10 06:00:00'

df = pd.DataFrame({
    'start_date': ['2022-03-10 01:20:00', '2022-03-10 02:18:00', '2022-03-10 02:10:00', '2022-03-10 02:40:00', '2022-03-10 02:45:00', '2022-03-10 03:05:00', '2022-03-10 03:12:00', '2022-03-10 05:30:00'],
    'end_date': ['2022-03-10 01:32:00', '2022-03-10 02:42:00', '2022-03-10 02:23:00', '2022-03-10 03:20:00', '2022-03-10 02:58:00', '2022-03-10 03:28:00', pd.NA, '2022-03-10 05:48:00'],
    'number_of_trip': ["637hui", "384nfj", "102fiu", "948pvc", "473mds", "103fkd", "905783", "498wsq"],
     'id': [1, 2, 3, 4, 5, 6, 7, 8]
})

df = df.replace(pd.NA, datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

a = bucket_count(3600, df, inicio, fin)

This result is correct

           start_date  seconds
0 2022-03-10 01:00:00      720
1 2022-03-10 02:00:00     4200
2 2022-03-10 03:00:00     5460
3 2022-03-10 05:00:00     1080

CodePudding user response：

This is what I came up with.

First off the time on your method was:

%%timeit
bucket_count(3600, df, inicio, fin)
>>> 44 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Versus the method I came up with:

%%timeit
bucket_count_new(3600, df, inicio, fin)
>>>17.1 ms ± 577 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So about 50% faster. I think it will be even quicker on larger samples but I did not test this.

Additionally, I think your method has a bug in it, but I did not troubleshoot it. Here is what I came up with and the result.

import pandas as pd

def bucket_count_new(seconds_bucket: int, data: pd.DataFrame, start: pd.Timestamp, end: pd.Timestamp):
    # filter out rows that are not within the range of start to end
    filtered_data = data.loc[data.start_date.lt(end) & data.end_date.gt(start)]

    # cut records start and ending outside of range to time range
    filtered_data.loc[filtered_data.start_date.lt(start), 'start_date'] = start
    filtered_data.loc[filtered_data.end_date.gt(end) | filtered_data.end_date.isnull(), 'end_date'] = end

    # add total trip seconds within date range
    filtered_data['trip_seconds'] = (filtered_data.end_date - filtered_data.start_date).dt.total_seconds()

    results = pd.DataFrame(
        {'start_date': pd.date_range(start=start, end=end, freq=pd.Timedelta(seconds=seconds_bucket)),
         'total_seconds': 0})

    for ind, trip_start, trip_end, trip_num, trip_id, trip_seconds in filtered_data.to_records():

        # get first bucket index after start
        first_ind_after_start = results['start_date'].gt(trip_start).values.argmax()

        # calculate first timedelta to get to next bucket start
        first_timedelta = (results.loc[first_ind_after_start, 'start_date'] - trip_start).total_seconds()

        # if the first timedelta is less than the total seconds for that trip add the trip_seconds else the timedelta
        results.loc[
            first_ind_after_start - 1, 'total_seconds']  = first_timedelta if first_timedelta <= trip_seconds else trip_seconds

        # subtract time from trip added to first bucket for trip
        trip_seconds -= first_timedelta

        # iterate through next buckets subtracting time until gone
        ind = first_ind_after_start
        while trip_seconds >= seconds_bucket:
            results.loc[ind, 'total_seconds']  = seconds_bucket
            trip_seconds -= seconds_bucket
            ind  = 1

        if trip_seconds > 0:
            results.loc[ind, 'total_seconds'] = trip_seconds

    return results


inicio = pd.Timestamp('2022-03-10 01:00:00')
fin = pd.Timestamp('2022-03-10 06:00:00')

df = pd.DataFrame({
    'start_date': ['2022-03-10 01:20:00', '2022-03-10 02:18:00', '2022-03-10 02:10:00', '2022-03-10 02:40:00',
                   '2022-03-10 02:45:00', '2022-03-10 03:05:00', '2022-03-10 03:12:00', '2022-03-10 05:30:00'],
    'end_date': ['2022-03-10 01:32:00', '2022-03-10 02:42:00', '2022-03-10 02:23:00', '2022-03-10 03:20:00',
                 '2022-03-10 02:58:00', '2022-03-10 03:28:00', pd.NA, '2022-03-10 05:48:00'],
    'number_of_trip': ["637hui", "384nfj", "102fiu", "948pvc", "473mds", "103fkd", "905783", "498wsq"],
    'id': [1, 2, 3, 4, 5, 6, 7, 8]
})

df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

a = bucket_count_new(3600, df.copy(deep=True), start=inicio, end=fin)

Result:

a
           start_date  total_seconds
0 2022-03-10 01:00:00            720
1 2022-03-10 02:00:00           4200
2 2022-03-10 03:00:00           2580
3 2022-03-10 04:00:00              0
4 2022-03-10 05:00:00           1080
5 2022-03-10 06:00:00              0

I believe this is correct unless I did my first filtering incorrectly because the total seconds within the range of start to end is the same as the total seconds in a. But I did not have time to debug it. Please let me know if this is the correct results.