How would I group, summarize and filter a DF in pandas in dplyr-fashion?-CodePudding

I'm currently studying pandas and I come from an R/dplyr/tidyverse background.

Pandas has a not-so-intuitive API and how would I elegantly rewrite such operation from dplyr using pandas syntax?

library("nycflights13")
library("tidyverse")

delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20, dest != "HNL")

CodePudding user response：

pd.DataFrame.agg method doesn't allow much flexibility for changing columns' names in the method itself

That's not exactly true. You could actually rename the columns inside agg similar to in R. Although it is a better idea to not use count as a column name as it is also an attribute:

    delays = (
    flights
    .groupby('dest', as_index=False)
    .agg(
        count=('year', 'count'),
        dist=('distance', 'mean'),
        delay=('arr_delay', 'mean'))
    .query('count > 20 & dest != "HNL"')
    .reset_index(drop=True)
)

CodePudding user response：

We can write a pandas concatenation of functions and methods that results in the same dataframe/tibble:

delays = (
    flights
    .groupby('dest', as_index=False)
    .agg({
        'year': 'count',
        'distance': 'mean',
        'arr_delay': 'mean',
    })
    .rename(columns={
        'year': 'count',
        'distance': 'dist',
        'arr_delay': 'delay',
    })
    .query('count > 20 & dest != "HNL"')
    .reset_index(drop=True)
)

It's more lengthy: Pandas' pd.DataFrame.agg method doesn't allow much flexibility for changing columns' names in the method itself.

But it's as elegant, clean and clear as pandas allows us to go.