How to rank GroupBy on Pyspark-CodePudding

I want to rank GroupBy on Pyspark, I can do it in pandas, but I need to do this on PySpark

Here's my input

id      year  month date  hour  minute
54807   2021     12   31     6      29
54807   2021     12   31     6      31
54807   2021     12   31     7      15
54807   2021     12   31     7      30

Here's pandas code

df["rank"] = df.groupby(["id", "hour"])["minute"].rank()

Here's my output

id      year  month date  hour  minute  rank
54807   2021     12   31     6      29  1.0
54807   2021     12   31     6      31  2.0
54807   2021     12   31     7      15  1.0
54807   2021     12   31     7      30  2.0

CodePudding user response：

you can use a ranking window function - rank, dense_rank, row_number.

here's an example with rank window function.

import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd

data_sdf. \
    withColumn('minute_rank', 
               func.rank().over(wd.partitionBy('id', 'year', 'month', 'date', 'hour').orderBy('minute'))
               ). \
    show()

#  ----- ---- ----- ---- ---- ------ ----------- 
# |   id|year|month|date|hour|minute|minute_rank|
#  ----- ---- ----- ---- ---- ------ ----------- 
# |54807|2021|   12|  31|   7|    15|          1|
# |54807|2021|   12|  31|   7|    30|          2|
# |54807|2021|   12|  31|   6|    29|          1|
# |54807|2021|   12|  31|   6|    31|          2|
#  ----- ---- ----- ---- ---- ------ -----------

CodePudding user response：

To rank the data in PySpark, you can use the rank() function with the window() function to specify the grouping criteria. Here's an example of how you can do this:

from pyspark.sql.functions import rank, window

# Create a window for the groupby criteria
w = window.partitionBy("id", "hour").orderBy("minute")

# Apply the rank function to the dataframe
df = df.withColumn("rank", rank().over(w))

This will add a new column "rank" to the dataframe, with the ranking for each group of "id" and "hour" based on the "minute" column.