How to group dataframe rows and perform an operation on rows of every group?-CodePudding

One obvious (and inefficient) approach is to collect group ids and iterate over them:

vals = df.groupBy('SomeField').agg(F.count("*").alias("Count")).collect()
for val in vals:
    group_df = df.where(df.SomeField == val.SomeField)
    do_something(group_df)

Is there any better way to do this?

CodePudding user response：

pd.pivot_table is probably what you're looking for.

pd.pivot_table(df, index=[some_fields], values=[other_fields], aggfunc=np.mean) or some other function

CodePudding user response：

Depending on what you are doing within do_something there are different ways of answering this.

One straightforward answer would be to make do_something a User Defined Function (UDF).

You could then have the following:

from pyspark.sql.types import IntegerType

# assuming you are returning an integer
udf_do_something = udf(do_something, IntegerType())

df.groupBy('SomeField').agg(udf_do_something(some_param_1, some_param_2))

some_param_id would depend on the necessary parameters you need to perform your task in do_something. For instance, if you want to process a list of values from a given column, you could use collect_list, or even combine it with additional spark sql functions like when.