I want to aggregate a pandas DataFrame using method chaining. I don't know how to start with the DataFrame and just refer to it when aggregating (using method chaining). Consider the following example that illustrates my intention:
Having this data:
import pandas as pd
my_df = pd.DataFrame({
'name': ['john', 'diana', 'rachel', 'chris'],
'favorite_color': ['red', 'blue', 'green', 'red']
})
my_df
#> name favorite_color
#> 0 john red
#> 1 diana blue
#> 2 rachel green
#> 3 chris red
and I want to end up with this summary table:
#> total_people total_ppl_who_like_red
#> 0 4 2
Of course there are so many ways to do it. One way, for instance, would be to build a new DataFrame:
desired_output_via_building_new_df = pd.DataFrame({
'total_people': [len(my_df)],
'total_ppl_who_like_red': [my_df.favorite_color.eq('red').sum()]
})
desired_output_via_building_new_df
#> total_people total_ppl_who_like_red
#> 0 4 2
However, I'm looking for a way to use "method chaining"; starting with my_df and working my way forward. Something along the lines of
# pseudo-code; not really working
my_df.agg({
'total_people': lambda x: len(x),
'total_ppl_who_like_red': lambda x: x.favorite_color.eq('red').sum()
})
I can only borrow inspiration from R/dplyr
code:
library(dplyr, warn.conflicts = FALSE)
my_df <-
data.frame(name = c("john", "diana", "rachel", "chris"),
favorite_color = c("red", "blue", "green", "red")
)
my_df |>
summarise(total_people = n(), ## in the context of `summarise()`,
total_ppl_who_like_red = sum(favorite_color == "red")) ## both `n()` and `sum()` refer to `my_df` because we start with `my_df` and pipe it "forward" to `summarise()`
#> total_people total_ppl_who_like_red
#> 1 4 2
CodePudding user response:
Solution for processing one Series
:
df = my_df.favorite_color.apply({
'total_people': lambda x: x.count(),
'total_ppl_who_like_red': lambda x: x.eq('red').sum()
}).to_frame(name=0).T
print (df)
total_people total_ppl_who_like_red
0 4 2
General solution for processing DataFrame
with DataFrame.pipe
- then pandas processing input DataFrame, if use apply
or agg
processing columns separately:
df = (my_df.pipe(lambda x: pd.Series({'total_people': len(x),
'total_ppl_who_like_red':
x.favorite_color.eq('red').sum()}))
.to_frame(name=0).T)
print (df)
total_people total_ppl_who_like_red
0 4 2
df = my_df2.pipe(lambda x: pd.Series({'total_people': len(x),
'total_ppl_who_like_red':
x.favorite_color.eq('red').sum(),
'max_age':x.age.max()
}).to_frame(name=0).T)
print (df)
total_people total_ppl_who_like_red max_age
0 4 2 41