I have a dataframe which looks like this
columns = ['id', 'department', 'score']
vals = [
(1, 'AB', 141),
(2, 'AB', 140),
(3, 'AB', 210),
(4, 'AB', 120),
(5, 'EF', 20),
(6, 'EF', 15)
]
I'd like to find the max score
of each group of department
and divide all values of that group. For example, in the above case:
max_val for AB is 210 max_val for EF is 20
The new dataset should be:
(1, 'AB', 0.67),
(2, 'AB', 0.67),
(3, 'AB', 1.00),
(4, 'AB', 0.57),
(5, 'EF', 1.00),
(6, 'EF', 0.75)
For now I've tried
>>> max_distance = df.groupby("department").agg({"score": "max"}).collect()
>>> max_distance
[Row(department='AB', max(score)=210.0), Row(department='EF', max(score)=20.0)]
But how do I divide it across the entire group?
CodePudding user response:
you should use the max
window function partitioned by department
.
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
data_sdf. \
withColumn('maxval', func.max('score').over(wd.partitionBy('department'))). \
withColumn('val_perc', func.col('score') / func.col('maxval')). \
show()
# --- ---------- ----- ------ ------------------
# | id|department|score|maxval| val_perc|
# --- ---------- ----- ------ ------------------
# | 5| EF| 20| 20| 1.0|
# | 6| EF| 15| 20| 0.75|
# | 1| AB| 141| 210|0.6714285714285714|
# | 2| AB| 140| 210|0.6666666666666666|
# | 3| AB| 210| 210| 1.0|
# | 4| AB| 120| 210|0.5714285714285714|
# --- ---------- ----- ------ ------------------