Home > Software design >  Python function to aggregate data by day performing sum and average on other columns
Python function to aggregate data by day performing sum and average on other columns

Time:09-27

I have a dataframe which looks like below:

trip_id     date    journey_duration    weekday
0   913460  2019-08-31  00:13:00    Sat
1   913459  2019-08-31  00:17:00    Sat
2   913455  2019-08-31  00:05:00    Sat
3   913454  2019-08-31  00:07:00    Sat
4   913453  2019-08-31  00:13:00    Sat
5   913452  2019-08-31  00:05:00    Sat
6   913451  2019-08-31  00:15:00    Sat
7   913450  2019-08-31  00:04:00    Sat
8   913449  2019-08-31  00:03:00    Sat
9   913448  2019-08-31  00:15:00    Sat
10  913443  2019-08-31  00:12:00    Sat
11  913442  2019-08-31  00:10:00    Sat
12  913441  2019-08-31  00:07:00    Sat
13  913440  2019-08-31  00:05:00    Sat
14  913435  2019-08-31  00:08:00    Sat
15  913434  2019-08-31  00:05:00    Sat
16  913433  2019-08-31  00:03:00    Sat
17  913432  2019-08-31  00:12:00    Sat
18  913431  2019-08-31  00:10:00    Sat
19  913429  2019-08-31  00:15:00    Sat

I would like to aggregate it to a daily level - changing the trip_id column to a count of number of trips per day and the journey duration to an average per day

I have used this:

trip_data = (pd.to_datetime(trip_data['date'])
       .dt.floor('d')
       .value_counts()
       .rename_axis('date')
       .reset_index(name='count'))

which works well to count the trips per day however this drops the journey duration

hope that makes sense, conscious my nomenclature might not be there as I'm a newbie

Thanks

CodePudding user response:

Here's a way to do what your question asks:

trip_data.date = pd.to_datetime(trip_data.date)
trip_data.journey_duration = pd.to_timedelta(trip_data.journey_duration)
trip_data = ( trip_data
    .assign(date=trip_data.date.dt.floor('d'))
    .groupby('date', as_index=False)
    .agg(count=("trip_id", "count"), journey_duration=("journey_duration", "mean")) )

Output:

        date  count journey_duration
0 2019-08-31     20  0 days 00:09:12

Explanation:

  • ensure date is a pandas datetime and journey_duration is a pandas timedelta type
  • round date to its day component using floor()
  • use groupby() to prepare for aggregation by unique date
  • use agg() to aggregate trip_id using count in a column named count and journey_duration using mean.

CodePudding user response:

First, convert date and journey_duration to datetime objects. Since journey_duration doesn't contain the day, month etc. it might be a better idea to use pd.to_timedelta for its conversion:

df['date'] = pd.to_datetime(df['date'])
df['journey_duration'] = pd.to_timedelta(df['journey_duration'])

Then set date as the index and convert the dataframe to the daily frequency and use agg for multiple operations on different columns:

df.set_index('date').resample('D').agg(no_trips_per_day=('trip_id', 'count'), \
                                         avg_duration=('journey_duration', 'mean'))

      no_trips_per_day  avg_duration
date        
2019-08-31      20  0 days 00:09:12
  • Related