Is there an elegant pyspark solution for the following ranking problem?-CodePudding

is there an elegant function that exists for the following problem?

I've been tasked to create a function to determine the differences in days and rank the values. The closest positive number would rank as 0 and be the 'starting point'. From the starting point, depending on whether the ranked value is negative, or non-negative, the function would assign a rank to the value either positive or negative respectively.

Datediff()	Rank
-50	-3
-32	-2
-1	-1
5	0
14	1
32	2
128	3
254	4

My solution so far would be to separate the negative and positive numbers and use the window.partitionBy() function to assign the correlating rank. It would work, but I'm curious for a more elegant solution. :)

CodePudding user response：

You can use Window to generate serial numbers to use as rank:

from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window

df = spark.createDataFrame([(10,),(-10,),(5,),(-5,),(15,),(-15,)], ["dated_diff"])
window = Window.orderBy(df["dated_diff"])
df = df.select("dated_diff", row_number().over(window).alias("row_number"))
df.show()

 ---------- ---------- 
|dated_diff|row_number|
 ---------- ---------- 
|       -15|         1|
|       -10|         2|
|        -5|         3|
|         5|         4|
|        10|         5|
|        15|         6|
 ---------- ----------

Then find rank of first positive number:

first_positive_rank = df.filter("dated_diff>=0").first()["row_number"]
print(first_positive_rank)
>> 4

OR

first_positive_rank = df.filter("dated_diff<0").count()   1
print(first_positive_rank)
>> 4

And finally subtract that rank from rank of all:

df = df.withColumn("row_number", col("row_number") - first_positive_rank)
df.show()

 ---------- ---------- 
|dated_diff|row_number|
 ---------- ---------- 
|       -15|        -3|
|       -10|        -2|
|        -5|        -1|
|         5|         0|
|        10|         1|
|        15|         2|
 ---------- ----------