PySpark : How we select the dataframe which takes the highest values by Matricule?-CodePudding

I would like to select the dataframe which takes the highest values by Matricule. For exemple in this table.

Matricule	Count	Value
MA12	A101	25
MA12	K215	25
MA12	C231	70
MA12	G348	70
MA12	B401	70
MA12	E291	70
MA20	D34	16
MA20	A45	16
MA20	A40	15
MA20	G16	18
MA20	K26	18

we obtain This table :

Matricule	Count	Value
MA12	C231	70
MA12	G348	70
MA12	B401	70
MA12	E291	70
MA20	G16	18
MA20	K26	18

Please can we help me. Thanks.

CodePudding user response：

The way I used to solve your problem is to create a maxValue column with Function Max and partition over Matricule column.

You can find the code below:

from pyspark.sql import functions as F
from pyspark.sql import Window as W

(df
  .withColumn('maxValue', F.max(F.col('Value')).over(W.partitionBy(F.col('Matricule'))))
  .filter(F.col('Value') == F.col('maxValue'))
  .drop('maxValue')
)