I'm trying to use the pyspark applyInPandas
in my python code. Problem is, the function that I want to pass to it exists in the same class, and so it is defined as def func(self, key, df)
. This becomes an issue because applyInPandas
will error out saying I'm passing too many arguments to the underlying func (at most it allows a key
and df
params, so the self
is causing the issue). Is there any way around this?
The underlying goal is to process a pandas function on dataframe groups in parallel.
CodePudding user response:
As OP mentioned, one way is to just use @staticmethod
, which may not be desirable in some cases.
The pyspark source code for creating pandas_udf uses inspect.getfullargspec().args
(line 386, 436), this includes self
even if the class method is called from the instance. I would think this is a bug on their part (maybe worthwhile to raise a ticket).
To overcome this, the easiest way is to use functools.partial
which can help change the argspec, i.e. remove the self
argument and restore the number of args to 2.
This is based on the idea that calling an instance method is the same as calling the method directly from the class and supply the instance as the first argument (because of the descriptor magic):
A.func(A(), *args, **kwargs) == A().func(*args, **kwargs)
In a concrete example,
import functools
import inspect
class A:
def __init__(self, y):
self.y = y
def sum(self, a: int, b: int):
return (a b) * self.y
def x(self):
# calling the method using the class and then supply the self argument
f = functools.partial(A.sum, self)
print(f(1, 2))
print(inspect.getfullargspec(f).args)
A(2).x()
This will print
6 # can still use 'self.y'
['a', 'b'] # 2 arguments (without 'self')
Then, in OP's case, one can simply do the same for key, df
parameters:
class A:
def __init__(self):
...
def func(self, key, df):
...
def x(self):
f = functools.partial(A.func, self)
self.df.groupby(...).applyInPandas(f)