Home > Software engineering >  How to get the mean for multiple groups at once in PySpark
How to get the mean for multiple groups at once in PySpark

Time:06-15

Let's say I have the following dataframe:

age_group    occupation   sex   country    debt
31_40          attorney    M    USA        10000
41_50          doctor      F    Mexico     2000
21_30          dentist     F    Canada     7000
51_60          engineer    M    Hungary    9000
61_70          driver      M    Egypt      23000

Considering it contains millions of rows, how can I get the debt mean for each specific group in only one dataframe, so it would return something like that:

group      value     debt_mean
country    egypt     12500
country    usa       25000
age_group  21_30     5000
sex        f         15000
sex        m         15000
ocuppation driver    5200

CodePudding user response:

This is what can be done in Spark:

Setup:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('31_40', 'attorney', 'M', 'USA', 10000),
     ('41_50', 'doctor', 'F', 'Mexico', 2000),
     ('21_30', 'dentist', 'F', 'Canada', 7000),
     ('51_60', 'engineer', 'M', 'Hungary', 9000),
     ('61_70', 'driver', 'M', 'Egypt', 23000)],
    ['age_group', 'occupation', 'sex', 'country', 'debt']
)

Script:

grp_cols = ['age_group', 'occupation', 'sex', 'country']
df = (df
    .cube(grp_cols)
    .agg(F.avg('debt').alias('debt_mean'))
    .filter(F.aggregate(F.array(*grp_cols), F.lit(0), lambda acc, x: acc   x.isNotNull().cast('int')) == 1)
    .withColumn('map', F.from_json(F.to_json(F.struct(*grp_cols)), 'map<string,string>'))
    .select(
        F.map_keys('map')[0].alias('group'),
        F.map_values('map')[0].alias('value'),
        'debt_mean'
    )
)

Result:

df.show()
#  ---------- -------- --------- 
# |     group|   value|debt_mean|
#  ---------- -------- --------- 
# |       sex|       F|   4500.0|
# |occupation|attorney|  10000.0|
# | age_group|   41_50|   2000.0|
# |       sex|       M|  14000.0|
# |occupation|  doctor|   2000.0|
# |   country|  Mexico|   2000.0|
# | age_group|   31_40|  10000.0|
# |   country|     USA|  10000.0|
# |occupation| dentist|   7000.0|
# |   country|  Canada|   7000.0|
# | age_group|   61_70|  23000.0|
# | age_group|   51_60|   9000.0|
# |   country|   Egypt|  23000.0|
# |   country| Hungary|   9000.0|
# |occupation|  driver|  23000.0|
# |occupation|engineer|   9000.0|
# | age_group|   21_30|   7000.0|
#  ---------- -------- --------- 

This does not match with your example, so I hope the example was just for structure, not for data.

A way to improve this would be going into SQL and using grouping sets. Then it becomes a bit dirty, so I'll leave it to you if you really need more improvement.

CodePudding user response:

Use list comprehension to create of list of means for each of the columns. Then use reduce to union the df in the list of means.

Long hand;

from functools import reduce

meanDF_list= [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]

reduce(lambda a, b: b.unionByName(a),meanDF_list).show()

Chained

reduce(lambda a, b: b.unionByName(a), [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]).show()

 -------- ------- 
|   Group|   mean|
 -------- ------- 
|     USA|10000.0|
|  Mexico| 2000.0|
|  Canada| 7000.0|
| Hungary| 9000.0|
|   Egypt|23000.0|
|       M|14000.0|
|       F| 4500.0|
|attorney|10000.0|
|  doctor| 2000.0|
| dentist| 7000.0|
|engineer| 9000.0|
|  driver|23000.0|
|   31_40|10000.0|
|   41_50| 2000.0|
|   21_30| 7000.0|
|   51_60| 9000.0|
|   61_70|23000.0|
 -------- ------- 
  • Related