Home > Enterprise >  How to rank GroupBy on Pyspark
How to rank GroupBy on Pyspark

Time:12-08

I want to rank GroupBy on Pyspark, I can do it in pandas, but I need to do this on PySpark

Here's my input

id      year  month date  hour  minute
54807   2021     12   31     6      29
54807   2021     12   31     6      31
54807   2021     12   31     7      15
54807   2021     12   31     7      30

Here's pandas code

df["rank"] = df.groupby(["id", "hour"])["minute"].rank()

Here's my output

id      year  month date  hour  minute  rank
54807   2021     12   31     6      29  1.0
54807   2021     12   31     6      31  2.0
54807   2021     12   31     7      15  1.0
54807   2021     12   31     7      30  2.0

CodePudding user response:

you can use a ranking window function - rank, dense_rank, row_number.

here's an example with rank window function.

import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd

data_sdf. \
    withColumn('minute_rank', 
               func.rank().over(wd.partitionBy('id', 'year', 'month', 'date', 'hour').orderBy('minute'))
               ). \
    show()

#  ----- ---- ----- ---- ---- ------ ----------- 
# |   id|year|month|date|hour|minute|minute_rank|
#  ----- ---- ----- ---- ---- ------ ----------- 
# |54807|2021|   12|  31|   7|    15|          1|
# |54807|2021|   12|  31|   7|    30|          2|
# |54807|2021|   12|  31|   6|    29|          1|
# |54807|2021|   12|  31|   6|    31|          2|
#  ----- ---- ----- ---- ---- ------ ----------- 

CodePudding user response:

To rank the data in PySpark, you can use the rank() function with the window() function to specify the grouping criteria. Here's an example of how you can do this:

from pyspark.sql.functions import rank, window

# Create a window for the groupby criteria
w = window.partitionBy("id", "hour").orderBy("minute")

# Apply the rank function to the dataframe
df = df.withColumn("rank", rank().over(w))

This will add a new column "rank" to the dataframe, with the ranking for each group of "id" and "hour" based on the "minute" column.

  • Related