Input dataframe:
Item | L | W | H |
---|---|---|---|
I1 | 3 | 5 | 8 |
I2 | 2 | 1 | 2 |
I3 | 6 | 9 | 1 |
I4 | 7 | 3 | 4 |
The output dataframe should be as below. Create 3 new columns: L_n, W_n, H_n by checking the values from L, W, H cols. L_n is the longest dimension, W_n is the medium and H_n is the shortest dimension.
Item | L | W | H | L_n | W_n | H_n |
---|---|---|---|---|---|---|
I1 | 3 | 5 | 8 | 8 | 5 | 3 |
I2 | 2 | 1 | 2 | 2 | 2 | 1 |
I3 | 6 | 9 | 1 | 9 | 6 | 1 |
I4 | 7 | 3 | 4 | 7 | 4 | 3 |
CodePudding user response:
I suggest creating an array (array
), sorting it (array_sort
) and selecting elements one-by-one (element_at
).
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('I1', 3, 5, 8),
('I2', 2, 1, 2),
('I3', 6, 9, 1),
('I4', 7, 3, 4)],
['Item', 'L', 'W', 'H']
)
arr = F.array_sort(F.array('L', 'W', 'H'))
df = df.select(
'*',
F.element_at(arr, 3).alias('L_n'),
F.element_at(arr, 2).alias('W_n'),
F.element_at(arr, 1).alias('H_n'),
)
df.show()
# ---- --- --- --- --- --- ---
# |Item| L| W| H|L_n|W_n|H_n|
# ---- --- --- --- --- --- ---
# | I1| 3| 5| 8| 8| 5| 3|
# | I2| 2| 1| 2| 2| 2| 1|
# | I3| 6| 9| 1| 9| 6| 1|
# | I4| 7| 3| 4| 7| 4| 3|
# ---- --- --- --- --- --- ---