I would like to select the dataframe which takes the highest values by Matricule. For exemple in this table.
Matricule | Count | Value |
---|---|---|
MA12 | A101 | 25 |
MA12 | K215 | 25 |
MA12 | C231 | 70 |
MA12 | G348 | 70 |
MA12 | B401 | 70 |
MA12 | E291 | 70 |
MA20 | D34 | 16 |
MA20 | A45 | 16 |
MA20 | A40 | 15 |
MA20 | G16 | 18 |
MA20 | K26 | 18 |
we obtain This table :
Matricule | Count | Value |
---|---|---|
MA12 | C231 | 70 |
MA12 | G348 | 70 |
MA12 | B401 | 70 |
MA12 | E291 | 70 |
MA20 | G16 | 18 |
MA20 | K26 | 18 |
Please can we help me. Thanks.
CodePudding user response:
The way I used to solve your problem is to create a maxValue
column with Function Max and partition over Matricule
column.
You can find the code below:
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
.withColumn('maxValue', F.max(F.col('Value')).over(W.partitionBy(F.col('Matricule'))))
.filter(F.col('Value') == F.col('maxValue'))
.drop('maxValue')
)