I'm wondering if anyone can point me in the right direction. I'm in over my head on a pivot table. I have a Pandas dataframe. I loaded two .csv files, dropped/renamed columns, and used concat to put them together. That all went fine. It's a list of transit stops, showing their latitude and longitude and how many times a bus stops at that stop on weekdays, Saturdays, and Sundays.
Here's what it looks like now:
stop_code service_id daily_trips stop_lat stop_lon
99994 saturday 36 40.439562 -79.995332
99994 sunday 32 40.439562 -79.995332
99994 weekday 50 40.439562 -79.995332
99994 saturday 71 40.439562 -79.995332
99994 sunday 65 40.439562 -79.995332
99994 weekday 89 40.439562 -79.995332
Etc. for 25,000 or so different stop_codes. I would like to make rows that each represent one stop code (and its latitude and longitude), along with the total number of weekday, Saturday, and Sunday trips for that stop. So, it would look like:
stop_code daily_wkdy_trips daily_sat_trips daily_sun_trips stop_lat stop_lon
99994 183 141 130 40.439562 -79.995332
A simpler illustration of what I want to do is here:
New Dataframe
So, is this something I should I use a pivot table to do? I can't figure out how to do all this at once:
- Collapse the data to one row for each stop_code (and its latitude and longitude).
- Add the values for each category of service_id together.
- Create new columns for those values.
CodePudding user response:
You can use .groupby()
followed by .agg
:
x = df.groupby("stop_code").agg(
stop_lat=("stop_lat", "first"),
stop_lon=("stop_lon", "first"),
daily_wkdy_trips=(
"daily_trips",
lambda x: x[df.loc[x.index, "service_id"].eq("weekday")].sum(),
),
daily_sat_trips=(
"daily_trips",
lambda x: x[df.loc[x.index, "service_id"].eq("saturday")].sum(),
),
daily_sun_trips=(
"daily_trips",
lambda x: x[df.loc[x.index, "service_id"].eq("sunday")].sum(),
),
)
print(x)
Prints:
stop_lat stop_lon daily_wkdy_trips daily_sat_trips daily_sun_trips
stop_code
99994 40.439562 -79.995332 183 141 130
CodePudding user response:
I would suggest separating 2 different tasks and then joining the results:
pivot = pd.pivot_table(
data=df,
values='daily_trips',
index='stop_code',
columns='service_id',
aggfunc='sum'
)
grouped = df.groupby('stop_code')['stop_lat', 'stop_lon'].first()
result = pivot.join(grouped)