i have a dataframe:
id vehicle asIs EU EU_variant
1 A3345 PQ1298 FV1 FV1_variant
2 A3346 PQ1287 FV2 FV2_variant
3 A3346 PQ1207 FV2 FV2_variant
4 A3347 QP9 QP9_variant
5 A3347 QP9 QP9_variant
6 A3347 QP3 QP3_variant
7 A3348 MP6553 YR34 YR34_variant
8 A3348 MP6554 YR35 YR35_variant
9 A3348 MP6554 YR35 YR35_variant
for distinct vehicle and distinct EU i want to rank EU_variant and concat it in new column
result should be:
id vehicle asIs EU EU_variant ECU_Variant_rank
1 A3345 PQ1298 FV1 FV1_variant FV1_variant(1)
2 A3346 PQ1287 FV2 FV2_variant FV2_variant(1)
3 A3346 PQ1207 FV2 FV2_variant FV2_variant(2)
4 A3347 QP9 QP9_variant QP9_variant(1)
5 A3347 QP9 QP9_variant QP9_variant(2)
6 A3347 QP3 QP3_variant QP3_variant(1)
7 A3348 MP6553 YR34 YR34_variant YR34_variant(1)
8 A3348 MP6554 YR35 YR35_variant YR35_variant(1)
9 A3348 MP6554 YR35 YR35_variant YR35_variant(2)
how to achieve this using pyspark dataframe
CodePudding user response:
You can use a Window
with rank:
from pyspark.sql import functions as F, Window
# you can order by the column you prefer, not only id
w = Window.partitionBy('vehicle', 'EU_variant').orderBy('id')
df.withColumn(
'ECU_Variant_rank',
F.concat_ws('', F.col('EU_variant'), F.lit('('), F.rank().over(w), F.lit(')'))
)
Here the result:
--- ------- ------ ---- ------------ ----------------
|id |vehicle|asIs |EU |EU_variant |ECU_Variant_rank|
--- ------- ------ ---- ------------ ----------------
|1 |A3345 |PQ1298|FV1 |FV1_variant |FV1_variant(1) |
|2 |A3346 |PQ1287|FV2 |FV2_variant |FV2_variant(1) |
|3 |A3346 |PQ1207|FV2 |FV2_variant |FV2_variant(2) |
|4 |A3347 |null |QP9 |QP9_variant |QP9_variant(1) |
|5 |A3347 |null |QP9 |QP9_variant |QP9_variant(2) |
|6 |A3347 |null |QP3 |QP3_variant |QP3_variant(1) |
|7 |A3348 |MP6553|YR34|YR34_variant|YR34_variant(1) |
|8 |A3348 |MP6554|YR35|YR35_variant|YR35_variant(1) |
|9 |A3348 |MP6554|YR35|YR35_variant|YR35_variant(2) |
--- ------- ------ ---- ------------ ----------------