I am confused why A Pandas Groupby function can be written both of the ways below and yield the same result. The specific code is not really the question, both give the same result. I would like someone to breakdown the syntax of both.
df.groupby(['gender'])['age'].mean()
df.groupby(['gender']).mean()['age']
In the first instance, It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after? Are there runtime considerations.
CodePudding user response:
It reads as if you are calling the
.mean()
function on the age column specifically. The second appears like you are calling.mean()
on the whole groupby object and selecting the age column after?
This is exactly what's happening. df.groupby()
returns a dataframe. The .mean()
method is applied column-wise by default, so the mean of each column is calculated independent of the other columns and the results are returned as a Series
(which can be indexed) if run on the full dataframe.
Reversing the order produces a single column as a Series
and then calculates the mean. If you know you only want the mean for a single column, it will be faster to isolate that first, rather than calculate the mean for every column (especially if you have a very large dataframe).
CodePudding user response:
Think of groupby
as a rows-separation function. It groups all rows having the same attributes (as specified in by
parameter) into separate data frames.
After the groupby
, you need an aggregate function to summarize data in each subframe. You can do that in a number of ways:
# In each subframe, take the `age` column and summarize it
# using the `mean function from the result
df.groupby(['gender'])['age'].mean()
# In each subframe, apply the `mean` function to all numeric
# columns then extract the `age` column
df.groupby(['gender']).mean()['age']
The first method is more efficient since you are applying the aggregate function (mean
) on a single column.