I need a function that calculates the maximum value per row (the largest) of a given number of columns, for example :
col1 | col2 | col3 | Max_col |
---|---|---|---|
0 | 1 | 5 | 5 |
3 | 0 | 0 | 3 |
1 | 0 | 0 | 1 |
In this case, I have three columns (col1,col2,col3); however, the number of columns can vary between 1 and 48 depending on the user's need.
I am currently working on pyspark:
from pyspark.sql import functions as f
df= df.withColumn('MAX_COLS',f.greatest('COL01','COL02','COL03','COL04','COLM05'))
How can I build a function that adapts to the number of columns specified by the user and calculates the corresponding maximum?
CodePudding user response:
Your question is vague. How will the columns you want to find max value accross be named?
In pyspark, leverage arrays. Put all the columns into an array and find max using array_max
df=spark.createDataFrame([('Other',111957.0,35293.0,225852.0,35110.0,1023680.0,448736.0,256473.0,269856.0,306668.0,8807.0,89551.0),
('Down',575614.0,203186.0,0.0,125056.0,0.0,766086.0,1157311.0,11127.0,88741.0,31603.0,300733.0),
('Up',0.0,0.0,1953645.0,0.0,346423.0,0.0,0.0,0.0,0.0,0.0,0.0),
('Same',2948065.0,730113.0,33121.0,668868.0,5451224.0,4485121.0,30780025.0,1977361.0,5295598.0,217697.0,1790024.0),
('Old',186596.0,88257.0,0.0,36842.0,2173626.0,240619.0,0.0,2770.0,2212560.0,9865.0,121045.0),
('New',0.0,0.0,0.0,0.0,3148.0,0.0,97252.0,0.0,0.0,0.0,0.0)],
('desc','B1','B2','B3','B4','B5','B6','B7','B8','B9','B10','B11'))
df.withColumn('max_col',array_max(array(*df.select(df.colRegex("`^B.*`")).columns))).show(truncate=False)
----- --------- -------- --------- -------- --------- --------- ----------- --------- --------- -------- --------- -----------
|desc |B1 |B2 |B3 |B4 |B5 |B6 |B7 |B8 |B9 |B10 |B11 |max_col |
----- --------- -------- --------- -------- --------- --------- ----------- --------- --------- -------- --------- -----------
|Other|111957.0 |35293.0 |225852.0 |35110.0 |1023680.0|448736.0 |256473.0 |269856.0 |306668.0 |8807.0 |89551.0 |1023680.0 |
|Down |575614.0 |203186.0|0.0 |125056.0|0.0 |766086.0 |1157311.0 |11127.0 |88741.0 |31603.0 |300733.0 |1157311.0 |
|Up |0.0 |0.0 |1953645.0|0.0 |346423.0 |0.0 |0.0 |0.0 |0.0 |0.0 |0.0 |1953645.0 |
|Same |2948065.0|730113.0|33121.0 |668868.0|5451224.0|4485121.0|3.0780025E7|1977361.0|5295598.0|217697.0|1790024.0|3.0780025E7|
|Old |186596.0 |88257.0 |0.0 |36842.0 |2173626.0|240619.0 |0.0 |2770.0 |2212560.0|9865.0 |121045.0 |2212560.0 |
|New |0.0 |0.0 |0.0 |0.0 |3148.0 |0.0 |97252.0 |0.0 |0.0 |0.0 |0.0 |97252.0 |
----- --------- -------- --------- -------- --------- --------- ----------- --------- --------- -------- --------- -----------