How to select the first representative element for each group of a DataFrameGroupBy object?-CodePudding

I am having the following dataframe

data = [
    [1000, 1, 1], [1000, 1, 1], [1000, 1, 1], [1000, 1, 2], [1000, 1, 2],
    [1000, 1, 2], [2000, 0, 1], [2000, 0, 1], [2000, 1, 2],
    [2000, 0, 2], [2000, 1, 2]]
df = pd.DataFrame(data, columns=['route_id', 'direction_id', 'trip_id'])

Then, I group my df based on the columns route_id, direction_id by using:

t_groups = df.groupby(['route_id','direction_id'])

I would like to store the value of the trip_id column based on the first most popular trip_id of each unique route_id, direction_id combination.

Ι have tried to apply a function value_counts() but I cannot get the first popular trip_id value.

I would like my expected output to be like:

   route_id  direction_id  trip_id
0      1000             1        1
1      2000             0        1
2      2000             1        2

Any suggestions?

CodePudding user response：

To store the value of the trip_id column based on the first most popular trip_id of each unique route_id, direction_id combination, you can use the idxmax method on the groupby object to get the index of the first most popular trip_id, and then use this index to access the value of the trip_id column.

Here is an example of how you can do this:

import pandas as pd

# Create the dataframe
data = [[1000, 1, 1], [1000, 1, 1], [1000, 1, 1], [1000, 1, 2], [1000, 1, 2], [1000, 1, 2], [2000, 0, 1], [2000, 0, 1], [2000, 1, 2], [2000, 0, 2], [2000, 1, 2]]
df = pd.DataFrame(data, columns=['route_id', 'direction_id', 'trip_id'])

# Group the dataframe by route_id and direction_id
t_groups = df.groupby(['route_id','direction_id'])

# Get the index of the first most popular trip_id for each group
idx = t_groups['trip_id'].apply(lambda x: x.value_counts().index[0])

# Access the value of the trip_id column at the index for each group
trip_ids = t_groups['trip_id'].apply(lambda x: x.loc[idx])

# Print the values of the trip_id column for each group
print(trip_ids)

CodePudding user response：

This is what you are looking for.

df = df.groupby(['route_id', 'direction_id']).first().reset_index()

The reset_index() just moves your indices into columns looking exactly like the output you want.