I was recently practicing some Python and I came onto a roadblock where I couldn't make my agg() to work, I later found out that it was because I didn't have to call the functions.
My question here is: I'd like somebody to please explain what are we exactly doing when we write () at the end of the function and what's the difference between doing it and not doing it
EDIT: THIS CODE IS EXAMPLE CODE, IM NOT LOOKING FOR AN ANSWER ON THIS CODE. I'M LOOKING FOR AN ANSWER ON THE CONCEPT OF CALLING OR NOT CALLING A FUNCTION AND HOW DOES THAT WORK.
What I was using which returned error: 'no a specified' (no argument)
sales_stats = sales.groupby('type')['weekly_sales'].agg([np.min(),np.max(),np.median(),np.mean()])
Correct code:
For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby('type')['weekly_sales'].agg([np.min,np.max,np.median,np.mean])
CodePudding user response:
In
sales.groupby('type')['weekly_sales'].agg([np.min,...]
sales
is a Pandas dataframe, groupby('type')
is a method call that returns GroupBy
object, which in turn has a agg
method.
Looking up its docs:
According to that the first argument of agg
is a
func : function, string, dictionary, or list of string/functions
In Python, functions are 'first class objects', that is, they can be passed as arguments just like numbers and lists, and can be put in a list as well.
np.max
is a function (in the numpy
module). [np.max, np.min]
is a list of functions.
np.max
is the function:
In [2]: np.max
Out[2]: <function numpy.amax(a, axis=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)>
np.max(...)
is a calling of the function, and produces something else, not the function itself. In this case it returns a number:
In [3]: np.max(np.array([1,2,3]))
Out[3]: 3
agg
wants the function, not the number. agg
will take care of calling np.max
with arrays (or lists or Series) from the group.
Note that just adding ()
to a function may not do anything useful. It may even raise an error.
So you question is in part basic Python - the difference between a function and calling the function. But also a pandas
and numpy
question. And as such it requires reading the respective function/method documentation.
Note that the agg
docs specifies what the function
itself must accept.
Take the sample frame from the agg
docs:
It shows providing agg
with a string:
In [9]: df.groupby('A').agg('min')
Out[9]:
B C
A
1 1 -1.589447
2 3 -0.997238
agg
recognizes a specific set of strings, which it converts into function calls. Equivalently we can pass a function:
In [10]: df.groupby('A').agg(np.min)
Out[10]:
B C
A
1 1 -1.589447
2 3 -0.997238
But when we use np.min()
as you do, we get an error:
In [11]: df.groupby('A').agg(np.min())
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 df.groupby('A').agg(np.min())
File <__array_function__ internals>:4, in amin(*args, **kwargs)
TypeError: _amin_dispatcher() missing 1 required positional argument: 'a'
You summarized the error as " returned error: 'no a specified' (no argument)". It is not a good idea to do that on SO. You should read the error in full, and show it in full. The traceback tells us that the problem is with the np.min()
step. It didn't get as far as calling agg
.
read the traceback
read the docs