Is there a way to apply dense-rank in a pyspark data-frame, but when finding a tie, ranking the tie by first appearance?
Similarly at Pandas rank(method='first')
CodePudding user response:
The distributed nature of Spark
prevents implicitly identifying the order of appearance. If you input dataset contains a column like line_number
or row_number
then rank(method='first')
can be achieved.
Working Example
The following example relies on the dataframe from pd.rank
with a included Line_Number
field to have explicit ordering.
The dataframe is repartitioned to simulate random ordering after reading data.
import pyspark.sql.functions as F
from pyspark.sql import Window
data = [{"Line_Number": 1, "Animal": "cat", "Number_legs": 4}, {"Line_Number": 2, "Animal": "penguin", "Number_legs": 2},
{"Line_Number": 3, "Animal": "dog", "Number_legs": 4}, {"Line_Number": 4, "Animal": "spider", "Number_legs": 8},
{"Line_Number": 5, "Animal": "snake", "Number_legs": None}]
df = spark.createDataFrame(data).repartition(8)
window_spec = Window.orderBy(F.col("Number_legs").asc_nulls_last(), F.col("Line_Number"))
df.withColumn("rank", F.when(F.col("Number_legs").isNull(), F.lit(None)).otherwise(F.row_number().over(window_spec))).show()
Output
------- ----------- ----------- ----
| Animal|Line_Number|Number_legs|rank|
------- ----------- ----------- ----
|penguin| 2| 2| 1|
| cat| 1| 4| 2|
| dog| 3| 4| 3|
| spider| 4| 8| 4|
| snake| 5| null|null|
------- ----------- ----------- ----