How to get the mean for multiple groups at once in PySpark-CodePudding

Let's say I have the following dataframe:

age_group    occupation   sex   country    debt
31_40          attorney    M    USA        10000
41_50          doctor      F    Mexico     2000
21_30          dentist     F    Canada     7000
51_60          engineer    M    Hungary    9000
61_70          driver      M    Egypt      23000

Considering it contains millions of rows, how can I get the debt mean for each specific group in only one dataframe, so it would return something like that:

group      value     debt_mean
country    egypt     12500
country    usa       25000
age_group  21_30     5000
sex        f         15000
sex        m         15000
ocuppation driver    5200

CodePudding user response：

This is what can be done in Spark:

Setup:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('31_40', 'attorney', 'M', 'USA', 10000),
     ('41_50', 'doctor', 'F', 'Mexico', 2000),
     ('21_30', 'dentist', 'F', 'Canada', 7000),
     ('51_60', 'engineer', 'M', 'Hungary', 9000),
     ('61_70', 'driver', 'M', 'Egypt', 23000)],
    ['age_group', 'occupation', 'sex', 'country', 'debt']
)

Script:

grp_cols = ['age_group', 'occupation', 'sex', 'country']
df = (df
    .cube(grp_cols)
    .agg(F.avg('debt').alias('debt_mean'))
    .filter(F.aggregate(F.array(*grp_cols), F.lit(0), lambda acc, x: acc   x.isNotNull().cast('int')) == 1)
    .withColumn('map', F.from_json(F.to_json(F.struct(*grp_cols)), 'map<string,string>'))
    .select(
        F.map_keys('map')[0].alias('group'),
        F.map_values('map')[0].alias('value'),
        'debt_mean'
    )
)

Result:

df.show()
#  ---------- -------- --------- 
# |     group|   value|debt_mean|
#  ---------- -------- --------- 
# |       sex|       F|   4500.0|
# |occupation|attorney|  10000.0|
# | age_group|   41_50|   2000.0|
# |       sex|       M|  14000.0|
# |occupation|  doctor|   2000.0|
# |   country|  Mexico|   2000.0|
# | age_group|   31_40|  10000.0|
# |   country|     USA|  10000.0|
# |occupation| dentist|   7000.0|
# |   country|  Canada|   7000.0|
# | age_group|   61_70|  23000.0|
# | age_group|   51_60|   9000.0|
# |   country|   Egypt|  23000.0|
# |   country| Hungary|   9000.0|
# |occupation|  driver|  23000.0|
# |occupation|engineer|   9000.0|
# | age_group|   21_30|   7000.0|
#  ---------- -------- ---------

This does not match with your example, so I hope the example was just for structure, not for data.

A way to improve this would be going into SQL and using grouping sets. Then it becomes a bit dirty, so I'll leave it to you if you really need more improvement.

CodePudding user response：

Use list comprehension to create of list of means for each of the columns. Then use reduce to union the df in the list of means.

Long hand;

from functools import reduce

meanDF_list= [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]

reduce(lambda a, b: b.unionByName(a),meanDF_list).show()

Chained

reduce(lambda a, b: b.unionByName(a), [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]).show()

 -------- ------- 
|   Group|   mean|
 -------- ------- 
|     USA|10000.0|
|  Mexico| 2000.0|
|  Canada| 7000.0|
| Hungary| 9000.0|
|   Egypt|23000.0|
|       M|14000.0|
|       F| 4500.0|
|attorney|10000.0|
|  doctor| 2000.0|
| dentist| 7000.0|
|engineer| 9000.0|
|  driver|23000.0|
|   31_40|10000.0|
|   41_50| 2000.0|
|   21_30| 7000.0|
|   51_60| 9000.0|
|   61_70|23000.0|
 -------- -------