I want to rank GroupBy on Pyspark, I can do it in pandas, but I need to do this on PySpark
Here's my input
id year month date hour minute
54807 2021 12 31 6 29
54807 2021 12 31 6 31
54807 2021 12 31 7 15
54807 2021 12 31 7 30
Here's pandas code
df["rank"] = df.groupby(["id", "hour"])["minute"].rank()
Here's my output
id year month date hour minute rank
54807 2021 12 31 6 29 1.0
54807 2021 12 31 6 31 2.0
54807 2021 12 31 7 15 1.0
54807 2021 12 31 7 30 2.0
CodePudding user response:
you can use a ranking window function - rank
, dense_rank
, row_number
.
here's an example with rank
window function.
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
data_sdf. \
withColumn('minute_rank',
func.rank().over(wd.partitionBy('id', 'year', 'month', 'date', 'hour').orderBy('minute'))
). \
show()
# ----- ---- ----- ---- ---- ------ -----------
# | id|year|month|date|hour|minute|minute_rank|
# ----- ---- ----- ---- ---- ------ -----------
# |54807|2021| 12| 31| 7| 15| 1|
# |54807|2021| 12| 31| 7| 30| 2|
# |54807|2021| 12| 31| 6| 29| 1|
# |54807|2021| 12| 31| 6| 31| 2|
# ----- ---- ----- ---- ---- ------ -----------
CodePudding user response:
To rank the data in PySpark, you can use the rank() function with the window() function to specify the grouping criteria. Here's an example of how you can do this:
from pyspark.sql.functions import rank, window
# Create a window for the groupby criteria
w = window.partitionBy("id", "hour").orderBy("minute")
# Apply the rank function to the dataframe
df = df.withColumn("rank", rank().over(w))
This will add a new column "rank" to the dataframe, with the ranking for each group of "id" and "hour" based on the "minute" column.