How to replace value in a column based on maximum value in same column in Pyspark?-CodePudding

I have a column named version with integer values 1,2,....upto 8. I want to replace all the integer values with the maximum number present in the same column version, In this case its 8, So I want to replace 1,2,3,4,5,6,7 with 8. I tried couple of methods but couldn't get the solution.

testDF = spark.createDataFrame([(1,"a"), (2,"b"), (3,"c"), (4,"d"), (5,"e"), (6,"f"), (7,"g"), (8,"h")], ["version", "name"])
testDF.show()
 ------- ---- 
|version|name|
 ------- ---- 
|      1|   a|
|      2|   b|
|      3|   c|
|      4|   d|
|      5|   e|
|      6|   f|
|      7|   g|
|      8|   h|
 ------- ----

expected

 ------- ---- 
|version|name|
 ------- ---- 
|      8|   a|
|      8|   b|
|      8|   c|
|      8|   d|
|      8|   e|
|      8|   f|
|      8|   g|
|      8|   h|
 ------- ----

CodePudding user response：

try this,

testDF=testDF.withColumn("version", lit(testDF.agg({"version": "max"}).collect()[0][0]))

Output:

 ------- ---- 
|version|name|
 ------- ---- 
|      8|   a|
|      8|   b|
|      8|   c|
|      8|   d|
|      8|   e|
|      8|   f|
|      8|   g|
|      8|   h|
 ------- ----

Increment value like below:

testDF.withColumn("version", lit(testDF.agg({"version": "max"}).collect()[0][0] 1))

Output:

 ------- ---- 
|version|name|
 ------- ---- 
|      9|   a|
|      9|   b|
|      9|   c|
|      9|   d|
|      9|   e|
|      9|   f|
|      9|   g|
|      9|   h|
 ------- ----