Let's say I have the following dataframe:
age_group occupation sex country debt
31_40 attorney M USA 10000
41_50 doctor F Mexico 2000
21_30 dentist F Canada 7000
51_60 engineer M Hungary 9000
61_70 driver M Egypt 23000
Considering it contains millions of rows, how can I get the debt mean for each specific group in only one dataframe, so it would return something like that:
group value debt_mean
country egypt 12500
country usa 25000
age_group 21_30 5000
sex f 15000
sex m 15000
ocuppation driver 5200
CodePudding user response:
This is what can be done in Spark:
Setup:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('31_40', 'attorney', 'M', 'USA', 10000),
('41_50', 'doctor', 'F', 'Mexico', 2000),
('21_30', 'dentist', 'F', 'Canada', 7000),
('51_60', 'engineer', 'M', 'Hungary', 9000),
('61_70', 'driver', 'M', 'Egypt', 23000)],
['age_group', 'occupation', 'sex', 'country', 'debt']
)
Script:
grp_cols = ['age_group', 'occupation', 'sex', 'country']
df = (df
.cube(grp_cols)
.agg(F.avg('debt').alias('debt_mean'))
.filter(F.aggregate(F.array(*grp_cols), F.lit(0), lambda acc, x: acc x.isNotNull().cast('int')) == 1)
.withColumn('map', F.from_json(F.to_json(F.struct(*grp_cols)), 'map<string,string>'))
.select(
F.map_keys('map')[0].alias('group'),
F.map_values('map')[0].alias('value'),
'debt_mean'
)
)
Result:
df.show()
# ---------- -------- ---------
# | group| value|debt_mean|
# ---------- -------- ---------
# | sex| F| 4500.0|
# |occupation|attorney| 10000.0|
# | age_group| 41_50| 2000.0|
# | sex| M| 14000.0|
# |occupation| doctor| 2000.0|
# | country| Mexico| 2000.0|
# | age_group| 31_40| 10000.0|
# | country| USA| 10000.0|
# |occupation| dentist| 7000.0|
# | country| Canada| 7000.0|
# | age_group| 61_70| 23000.0|
# | age_group| 51_60| 9000.0|
# | country| Egypt| 23000.0|
# | country| Hungary| 9000.0|
# |occupation| driver| 23000.0|
# |occupation|engineer| 9000.0|
# | age_group| 21_30| 7000.0|
# ---------- -------- ---------
This does not match with your example, so I hope the example was just for structure, not for data.
A way to improve this would be going into SQL and using grouping sets. Then it becomes a bit dirty, so I'll leave it to you if you really need more improvement.
CodePudding user response:
Use list comprehension to create of list of means for each of the columns. Then use reduce to union the df in the list of means.
Long hand;
from functools import reduce
meanDF_list= [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]
reduce(lambda a, b: b.unionByName(a),meanDF_list).show()
Chained
reduce(lambda a, b: b.unionByName(a), [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]).show()
-------- -------
| Group| mean|
-------- -------
| USA|10000.0|
| Mexico| 2000.0|
| Canada| 7000.0|
| Hungary| 9000.0|
| Egypt|23000.0|
| M|14000.0|
| F| 4500.0|
|attorney|10000.0|
| doctor| 2000.0|
| dentist| 7000.0|
|engineer| 9000.0|
| driver|23000.0|
| 31_40|10000.0|
| 41_50| 2000.0|
| 21_30| 7000.0|
| 51_60| 9000.0|
| 61_70|23000.0|
-------- -------