Home > Blockchain >  pyspark - Dense-Rank ties method first
pyspark - Dense-Rank ties method first

Time:11-12

Is there a way to apply dense-rank in a pyspark data-frame, but when finding a tie, ranking the tie by first appearance?

Similarly at Pandas rank(method='first')

CodePudding user response:

The distributed nature of Spark prevents implicitly identifying the order of appearance. If you input dataset contains a column like line_number or row_number then rank(method='first') can be achieved.

Working Example

The following example relies on the dataframe from pd.rank with a included Line_Number field to have explicit ordering.

The dataframe is repartitioned to simulate random ordering after reading data.

import pyspark.sql.functions as F
from pyspark.sql import Window

data = [{"Line_Number": 1, "Animal": "cat", "Number_legs": 4}, {"Line_Number": 2, "Animal": "penguin", "Number_legs": 2},
        {"Line_Number": 3, "Animal": "dog", "Number_legs": 4}, {"Line_Number": 4, "Animal": "spider", "Number_legs": 8},
        {"Line_Number": 5, "Animal": "snake", "Number_legs": None}]

df = spark.createDataFrame(data).repartition(8)


window_spec = Window.orderBy(F.col("Number_legs").asc_nulls_last(), F.col("Line_Number"))

df.withColumn("rank", F.when(F.col("Number_legs").isNull(), F.lit(None)).otherwise(F.row_number().over(window_spec))).show()

Output

 ------- ----------- ----------- ---- 
| Animal|Line_Number|Number_legs|rank|
 ------- ----------- ----------- ---- 
|penguin|          2|          2|   1|
|    cat|          1|          4|   2|
|    dog|          3|          4|   3|
| spider|          4|          8|   4|
|  snake|          5|       null|null|
 ------- ----------- ----------- ---- 
  • Related