I'm currently studying pandas and I come from an R/dplyr/tidyverse background.
Pandas has a not-so-intuitive API and how would I elegantly rewrite such operation from dplyr using pandas syntax?
library("nycflights13")
library("tidyverse")
delays <- flights %>%
group_by(dest) %>%
summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
CodePudding user response:
pd.DataFrame.agg method doesn't allow much flexibility for changing columns' names in the method itself
That's not exactly true. You could actually rename the columns inside agg
similar to in R. Although it is a better idea to not use count
as a column name as it is also an attribute:
delays = (
flights
.groupby('dest', as_index=False)
.agg(
count=('year', 'count'),
dist=('distance', 'mean'),
delay=('arr_delay', 'mean'))
.query('count > 20 & dest != "HNL"')
.reset_index(drop=True)
)
CodePudding user response:
We can write a pandas concatenation of functions and methods that results in the same dataframe/tibble:
delays = (
flights
.groupby('dest', as_index=False)
.agg({
'year': 'count',
'distance': 'mean',
'arr_delay': 'mean',
})
.rename(columns={
'year': 'count',
'distance': 'dist',
'arr_delay': 'delay',
})
.query('count > 20 & dest != "HNL"')
.reset_index(drop=True)
)
It's more lengthy: Pandas' pd.DataFrame.agg
method doesn't allow much flexibility for changing columns' names in the method itself.
But it's as elegant, clean and clear as pandas allows us to go.