How to pass dynamic number of columns?-CodePudding

df = spark.createDataFrame(
[[100,'a_',1],
[150,'a_',6],
[200,'a_',6],
[120,'b_',2],
[220,'c_', 3],
[230,'d_', 4],
[500,'e_',5],[
110,'a_',6],
[130,'b_',6],
[140,'b_',12]], ['id','type','cnt'])

As is:

df.withColumn(
        "rank", row_number().over(Window.partitionBy("type").orderBy(col("cnt").desc(), col("id").desc()))
    ).head(10)

To be. I want to make method

def rank(df, order):
    df.withColumn(
        "rank", row_number().over(Window.partitionBy("type").orderBy(order))
    ).head(10)

I want to pass multiple columns for ordering (col("cnt").desc(), col("id").desc()). If there's just one column, that's simple, but I should make a scalable method (to accept more columns). How to do it?

) If I want to another dynamic parameter, how to fix it?

def rank(df, ?, *order):
    df = df.withColumn("rank", row_number().over(Window.partitionBy(?).orderBy(*order))
    )
    return df

CodePudding user response：

Instead of order try using *order

I'm not sure what you want to do, but the following modified version of your rank function seems to work with different number columns provided.

def rank(df, *order):
    df = df.withColumn(
        "rank", row_number().over(Window.partitionBy("type").orderBy(*order))
    )
    return df

rank(df, asc("id")).show()
#  --- ---- --- ---- 
# | id|type|cnt|rank|
#  --- ---- --- ---- 
# |100|  a_|  1|   1|
# |110|  a_|  6|   2|
# |150|  a_|  6|   3|
# |200|  a_|  6|   4|
# |120|  b_|  2|   1|
# |130|  b_|  6|   2|
# |140|  b_| 12|   3|
# |220|  c_|  3|   1|
# |230|  d_|  4|   1|
# |500|  e_|  5|   1|
#  --- ---- --- ---- 

rank(df, col("cnt").desc(), col("id").desc()).show()
#  --- ---- --- ---- 
# | id|type|cnt|rank|
#  --- ---- --- ---- 
# |200|  a_|  6|   1|
# |150|  a_|  6|   2|
# |110|  a_|  6|   3|
# |100|  a_|  1|   4|
# |140|  b_| 12|   1|
# |130|  b_|  6|   2|
# |120|  b_|  2|   3|
# |220|  c_|  3|   1|
# |230|  d_|  4|   1|
# |500|  e_|  5|   1|
#  --- ---- --- ----