Home > Mobile >  Grouping and Summing Multiple Columns in a DataFrame
Grouping and Summing Multiple Columns in a DataFrame

Time:10-15

I have taken the following sample of data.

trip_id,vehicle_id,customer_id,fleet,trip_start,distance_miles,journey_duration
1,d3550e496af4,442342ac078e,Salt Lake City,2020-06-02 16:12:22,2.30266927956152,0 days 00:13:12.549351
2,2afc10228a2b,4d3ea6d8bb4b,Provo,2020-06-02 16:17:21,0.495335235709548,0 days 00:02:48.407770
3,442342ac078e,442342ac078e,Salt Lake City,2020-06-02 16:43:05,0.7933172567617909,0 days 00:15:33.417755
4,8701da8e6582,567c93d144ed,Provo,2020-06-02 19:34:40,0.9158009891104686,0 days 00:07:04.912849
5,b70fa4bc1486,391526cd2b71,Provo,2020-06-02 20:02:37,1.6248457639858709,0 days 00:11:51.821411
6,f6f0a689fc3a,2b9d754d1c4f,Provo,2020-06-02 20:57:27,0.8310125874177197,0 days 00:07:37.959237

I read this data into a df using:

df = pd.read_clipboard(sep=',')

enter image description here

What I'm struggling to figure out is how to create a summary table using this information. The output df I would like is below:

enter image description here

Here, I want to group by city, while being able to calculate the total number of unique vehicles, unique customers, trips, and the sum of the total distance and duration (in minutes) of every journey in that city.

For example, we can see that for row 0 and 2, there are 2 unique vehicles but it's from the same customer.

I have tried using groupby/summing/unique methods but have had issues when it comes to certain values I want to obtain. Any idea of where to go next? Cheers

CodePudding user response:

You need to convert a few columns and then you can just group and summarise

df['trip_start'] = pd.to_datetime(df['trip_start'], format='%Y-%d-%m %H:%M:%S')
df['journey_duration'] = pd.to_timedelta(df['journey_duration'])

df['Date'] = df['trip_start'].dt.strftime('%b %Y')

df.groupby(['Date', 'fleet']).agg(
    Total_Customers = ('customer_id', 'nunique'),
    Total_Vehicles = ('vehicle_id', 'nunique'),
    Total_Trips = ('trip_id', 'nunique'),
    Total_Distance = ('distance_miles', 'sum'),
    Total_Duration = ('journey_duration', 'sum'),
    )
  • Related