is there an elegant function that exists for the following problem?
I've been tasked to create a function to determine the differences in days and rank the values. The closest positive number would rank as 0 and be the 'starting point'. From the starting point, depending on whether the ranked value is negative, or non-negative, the function would assign a rank to the value either positive or negative respectively.
Datediff() | Rank |
---|---|
-50 | -3 |
-32 | -2 |
-1 | -1 |
5 | 0 |
14 | 1 |
32 | 2 |
128 | 3 |
254 | 4 |
My solution so far would be to separate the negative and positive numbers and use the window.partitionBy() function to assign the correlating rank. It would work, but I'm curious for a more elegant solution. :)
CodePudding user response:
You can use Window
to generate serial numbers to use as rank:
from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window
df = spark.createDataFrame([(10,),(-10,),(5,),(-5,),(15,),(-15,)], ["dated_diff"])
window = Window.orderBy(df["dated_diff"])
df = df.select("dated_diff", row_number().over(window).alias("row_number"))
df.show()
---------- ----------
|dated_diff|row_number|
---------- ----------
| -15| 1|
| -10| 2|
| -5| 3|
| 5| 4|
| 10| 5|
| 15| 6|
---------- ----------
Then find rank of first positive number:
first_positive_rank = df.filter("dated_diff>=0").first()["row_number"]
print(first_positive_rank)
>> 4
OR
first_positive_rank = df.filter("dated_diff<0").count() 1
print(first_positive_rank)
>> 4
And finally subtract that rank from rank of all:
df = df.withColumn("row_number", col("row_number") - first_positive_rank)
df.show()
---------- ----------
|dated_diff|row_number|
---------- ----------
| -15| -3|
| -10| -2|
| -5| -1|
| 5| 0|
| 10| 1|
| 15| 2|
---------- ----------