df = spark.createDataFrame(
[[100,'a_',1],
[150,'a_',6],
[200,'a_',6],
[120,'b_',2],
[220,'c_', 3],
[230,'d_', 4],
[500,'e_',5],[
110,'a_',6],
[130,'b_',6],
[140,'b_',12]], ['id','type','cnt'])
As is:
df.withColumn(
"rank", row_number().over(Window.partitionBy("type").orderBy(col("cnt").desc(), col("id").desc()))
).head(10)
To be. I want to make method
def rank(df, order):
df.withColumn(
"rank", row_number().over(Window.partitionBy("type").orderBy(order))
).head(10)
I want to pass multiple columns for ordering (col("cnt").desc(), col("id").desc()
).
If there's just one column, that's simple, but I should make a scalable method (to accept more columns). How to do it?
) If I want to another dynamic parameter, how to fix it?
def rank(df, ?, *order):
df = df.withColumn("rank", row_number().over(Window.partitionBy(?).orderBy(*order))
)
return df
CodePudding user response:
Instead of order
try using *order
I'm not sure what you want to do, but the following modified version of your rank
function seems to work with different number columns provided.
def rank(df, *order):
df = df.withColumn(
"rank", row_number().over(Window.partitionBy("type").orderBy(*order))
)
return df
rank(df, asc("id")).show()
# --- ---- --- ----
# | id|type|cnt|rank|
# --- ---- --- ----
# |100| a_| 1| 1|
# |110| a_| 6| 2|
# |150| a_| 6| 3|
# |200| a_| 6| 4|
# |120| b_| 2| 1|
# |130| b_| 6| 2|
# |140| b_| 12| 3|
# |220| c_| 3| 1|
# |230| d_| 4| 1|
# |500| e_| 5| 1|
# --- ---- --- ----
rank(df, col("cnt").desc(), col("id").desc()).show()
# --- ---- --- ----
# | id|type|cnt|rank|
# --- ---- --- ----
# |200| a_| 6| 1|
# |150| a_| 6| 2|
# |110| a_| 6| 3|
# |100| a_| 1| 4|
# |140| b_| 12| 1|
# |130| b_| 6| 2|
# |120| b_| 2| 3|
# |220| c_| 3| 1|
# |230| d_| 4| 1|
# |500| e_| 5| 1|
# --- ---- --- ----