Home > Net >  PySpark : How we select the dataframe which takes the highest values ​by Matricule?
PySpark : How we select the dataframe which takes the highest values ​by Matricule?

Time:10-07

I would like to select the dataframe which takes the highest values ​​by Matricule. For exemple in this table.

Matricule Count Value
MA12 A101 25
MA12 K215 25
MA12 C231 70
MA12 G348 70
MA12 B401 70
MA12 E291 70
MA20 D34 16
MA20 A45 16
MA20 A40 15
MA20 G16 18
MA20 K26 18

we obtain This table :

Matricule Count Value
MA12 C231 70
MA12 G348 70
MA12 B401 70
MA12 E291 70
MA20 G16 18
MA20 K26 18

Please can we help me. Thanks.

CodePudding user response:

The way I used to solve your problem is to create a maxValue column with Function Max and partition over Matricule column.

You can find the code below:

from pyspark.sql import functions as F
from pyspark.sql import Window as W

(df
  .withColumn('maxValue', F.max(F.col('Value')).over(W.partitionBy(F.col('Matricule'))))
  .filter(F.col('Value') == F.col('maxValue'))
  .drop('maxValue')
)
  • Related