I code mostly in R but I want to improve my python skills so now I do my analysis in python. When working with a data frame in R I would constantly make a new summarized dataframe with group_by and summarize. I am trying to do this in python with df.groupby().agg()
However I'm finding the process difficult. For example, in R I could apply functions and conditions to individual columns in summarize()
. I could also build off a new column just created in the previous line of code inside of summarize. I find this actually really cool what one can do in R.
Anyways how can I apply a function to an individual column like mean()
where a certain condition is met in another column? The following code shows how to do this in R and my attempt in python.
library(dplyr)
df %>% group_by(user_id) %>%
summarize(avg_parking_duration = mean(duration[type == "parking"]),
avg_reservation_duration = mean(duration[type == "reservation"])
)
The code above would group the data frame of parking transactions by user_id. It would then calculate the average parking duration from the column duration where the column type == "parking". Likewise the same is done for the duration of transactions of the type "reservation".
all that I have seen online for python is something like this
df.groupby('user_id').agg(
avg_parking_duration = ('duration','mean')
)
or
df.groupby('user_id').agg(
avg_parking_duration = ('duration',lambda x: x.mean())
)
I don't know unfortunately how to apply a condition then specifically to duration based on another column type. How do you go about this?
Please note I do not want to filter out first the data frame by type parkings or reservations before the agg()
because I would like to create new multiple columns with different functions inside of agg()
and I would need the filtered out information.
dataset
data = {'user_id': [u1, u1,u1,u1, u2, u2,u2,u2], 'type': ['parking', 'parking', 'reservation', 'reservation','parking', 'parking', 'reservation', 'reservation'],'duration': [43,60,20,23,143,290,25,15]}
pd.DataFrame.from_dict(data)
CodePudding user response:
Please note I do not want to filter out first the data frame by type parkings or reservations before the agg()
It is problematic. Because GroupBy.agg
working with each column separately, so cannot filter type
and get mean of duration
together.
Solution with helper column park
and used for mean
or hacky filtering original DataFrame per groups - both same ouput:
Solution with helper columns before groupby.agg
:
df = (df.assign(park = df["duration"].where(df["type"] == 'parking'),
res = df["duration"].where(df["type"] == 'reservation'))
.groupby('user_id').agg(avg_parking_duration = ('park','mean'),
avg_reservation_duration =('res','mean')
))
print (df)
avg_parking_duration avg_reservation_duration
user_id
1 51.5 21.5
2 216.5 20.0
Hacky way:
#necessary unique values in index
df = df.reset_index(drop=True)
f = lambda x: x.where(df.loc[x.index, "type"] == 'parking').mean()
f1 = lambda x: x.where(df.loc[x.index, "type"] == 'reservation').mean()
df = (df.groupby('user_id').agg(avg_parking_duration = ('duration',f),
avg_reservation_duration = ('duration',f1)
))
print (df)
avg_parking_duration avg_reservation_duration
user_id
1 51.5 21.5
2 216.5 20.0