Home > other >  Pyspark groupby column and divide by max value
Pyspark groupby column and divide by max value

Time:11-01

I have a dataframe which looks like this

columns = ['id', 'department', 'score']
vals = [
    (1, 'AB', 141),
    (2, 'AB', 140),
    (3, 'AB', 210),
    (4, 'AB', 120),
    (5, 'EF', 20),
    (6, 'EF', 15)
]

I'd like to find the max score of each group of department and divide all values of that group. For example, in the above case:

max_val for AB is 210 max_val for EF is 20

The new dataset should be:

    (1, 'AB', 0.67),
    (2, 'AB', 0.67),
    (3, 'AB', 1.00),
    (4, 'AB', 0.57),
    (5, 'EF', 1.00),
    (6, 'EF', 0.75)

For now I've tried

>>> max_distance = df.groupby("department").agg({"score": "max"}).collect()
>>> max_distance
[Row(department='AB', max(score)=210.0), Row(department='EF', max(score)=20.0)]

But how do I divide it across the entire group?

CodePudding user response:

you should use the max window function partitioned by department.

import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd

data_sdf. \
    withColumn('maxval', func.max('score').over(wd.partitionBy('department'))). \
    withColumn('val_perc', func.col('score') / func.col('maxval')). \
    show()

#  --- ---------- ----- ------ ------------------ 
# | id|department|score|maxval|          val_perc|
#  --- ---------- ----- ------ ------------------ 
# |  5|        EF|   20|    20|               1.0|
# |  6|        EF|   15|    20|              0.75|
# |  1|        AB|  141|   210|0.6714285714285714|
# |  2|        AB|  140|   210|0.6666666666666666|
# |  3|        AB|  210|   210|               1.0|
# |  4|        AB|  120|   210|0.5714285714285714|
#  --- ---------- ----- ------ ------------------ 
  • Related