One obvious (and inefficient) approach is to collect group ids and iterate over them:
vals = df.groupBy('SomeField').agg(F.count("*").alias("Count")).collect()
for val in vals:
group_df = df.where(df.SomeField == val.SomeField)
do_something(group_df)
Is there any better way to do this?
CodePudding user response:
pd.pivot_table
is probably what you're looking for.
pd.pivot_table(df, index=[some_fields], values=[other_fields], aggfunc=np.mean)
or some other function
CodePudding user response:
Depending on what you are doing within do_something
there are different ways of answering this.
One straightforward answer would be to make do_something a User Defined Function (UDF).
You could then have the following:
from pyspark.sql.types import IntegerType
# assuming you are returning an integer
udf_do_something = udf(do_something, IntegerType())
df.groupBy('SomeField').agg(udf_do_something(some_param_1, some_param_2))
some_param_id
would depend on the necessary parameters you need to perform your task in do_something
. For instance, if you want to process a list of values from a given column, you could use collect_list, or even combine it with additional spark sql functions like when
.