I have a dataframe which looks like below:
trip_id date journey_duration weekday
0 913460 2019-08-31 00:13:00 Sat
1 913459 2019-08-31 00:17:00 Sat
2 913455 2019-08-31 00:05:00 Sat
3 913454 2019-08-31 00:07:00 Sat
4 913453 2019-08-31 00:13:00 Sat
5 913452 2019-08-31 00:05:00 Sat
6 913451 2019-08-31 00:15:00 Sat
7 913450 2019-08-31 00:04:00 Sat
8 913449 2019-08-31 00:03:00 Sat
9 913448 2019-08-31 00:15:00 Sat
10 913443 2019-08-31 00:12:00 Sat
11 913442 2019-08-31 00:10:00 Sat
12 913441 2019-08-31 00:07:00 Sat
13 913440 2019-08-31 00:05:00 Sat
14 913435 2019-08-31 00:08:00 Sat
15 913434 2019-08-31 00:05:00 Sat
16 913433 2019-08-31 00:03:00 Sat
17 913432 2019-08-31 00:12:00 Sat
18 913431 2019-08-31 00:10:00 Sat
19 913429 2019-08-31 00:15:00 Sat
I would like to aggregate it to a daily level - changing the trip_id column to a count of number of trips per day and the journey duration to an average per day
I have used this:
trip_data = (pd.to_datetime(trip_data['date'])
.dt.floor('d')
.value_counts()
.rename_axis('date')
.reset_index(name='count'))
which works well to count the trips per day however this drops the journey duration
hope that makes sense, conscious my nomenclature might not be there as I'm a newbie
Thanks
CodePudding user response:
Here's a way to do what your question asks:
trip_data.date = pd.to_datetime(trip_data.date)
trip_data.journey_duration = pd.to_timedelta(trip_data.journey_duration)
trip_data = ( trip_data
.assign(date=trip_data.date.dt.floor('d'))
.groupby('date', as_index=False)
.agg(count=("trip_id", "count"), journey_duration=("journey_duration", "mean")) )
Output:
date count journey_duration
0 2019-08-31 20 0 days 00:09:12
Explanation:
- ensure
date
is a pandas datetime andjourney_duration
is a pandas timedelta type - round
date
to its day component usingfloor()
- use
groupby()
to prepare for aggregation by uniquedate
- use
agg()
to aggregatetrip_id
usingcount
in a column namedcount
andjourney_duration
usingmean
.
CodePudding user response:
First, convert date
and journey_duration
to datetime objects. Since journey_duration
doesn't contain the day, month etc. it might be a better idea to use pd.to_timedelta
for its conversion:
df['date'] = pd.to_datetime(df['date'])
df['journey_duration'] = pd.to_timedelta(df['journey_duration'])
Then set date
as the index and convert the dataframe to the daily frequency and use agg
for multiple operations on different columns:
df.set_index('date').resample('D').agg(no_trips_per_day=('trip_id', 'count'), \
avg_duration=('journey_duration', 'mean'))
no_trips_per_day avg_duration
date
2019-08-31 20 0 days 00:09:12