Home > Software design >  Struggling with Pivot table in Python - collapsing rows and performing calculations
Struggling with Pivot table in Python - collapsing rows and performing calculations

Time:04-20

I'm wondering if anyone can point me in the right direction. I'm in over my head on a pivot table. I have a Pandas dataframe. I loaded two .csv files, dropped/renamed columns, and used concat to put them together. That all went fine. It's a list of transit stops, showing their latitude and longitude and how many times a bus stops at that stop on weekdays, Saturdays, and Sundays.

Here's what it looks like now:


Current Dataframe

stop_code  service_id   daily_trips   stop_lat      stop_lon
99994     saturday   36         40.439562   -79.995332
99994     sunday      32        40.439562   -79.995332
99994     weekday      50          40.439562   -79.995332
99994     saturday   71      40.439562  -79.995332
99994     sunday     65      40.439562   -79.995332
99994     weekday   89     40.439562  -79.995332

Etc. for 25,000 or so different stop_codes. I would like to make rows that each represent one stop code (and its latitude and longitude), along with the total number of weekday, Saturday, and Sunday trips for that stop. So, it would look like:

stop_code  daily_wkdy_trips  daily_sat_trips  daily_sun_trips   stop_lat     stop_lon
99994     183        141      130       40.439562  -79.995332

A simpler illustration of what I want to do is here:
New Dataframe

So, is this something I should I use a pivot table to do? I can't figure out how to do all this at once:

  • Collapse the data to one row for each stop_code (and its latitude and longitude).
  • Add the values for each category of service_id together.
  • Create new columns for those values.

CodePudding user response:

You can use .groupby() followed by .agg:

x = df.groupby("stop_code").agg(
    stop_lat=("stop_lat", "first"),
    stop_lon=("stop_lon", "first"),
    daily_wkdy_trips=(
        "daily_trips",
        lambda x: x[df.loc[x.index, "service_id"].eq("weekday")].sum(),
    ),
    daily_sat_trips=(
        "daily_trips",
        lambda x: x[df.loc[x.index, "service_id"].eq("saturday")].sum(),
    ),
    daily_sun_trips=(
        "daily_trips",
        lambda x: x[df.loc[x.index, "service_id"].eq("sunday")].sum(),
    ),
)
print(x)

Prints:

            stop_lat   stop_lon  daily_wkdy_trips  daily_sat_trips  daily_sun_trips
stop_code                                                                          
99994      40.439562 -79.995332               183              141              130

CodePudding user response:

I would suggest separating 2 different tasks and then joining the results:

pivot = pd.pivot_table(
    data=df,
    values='daily_trips',
    index='stop_code',
    columns='service_id',
    aggfunc='sum'
)
grouped = df.groupby('stop_code')['stop_lat', 'stop_lon'].first()
result = pivot.join(grouped)
  • Related