I have the following data frame in pyspark:
>>> df.show()
---------- ------
| date_col|counts|
---------- ------
|2022-02-05|350647|
|2022-02-06|313091|
---------- ------
I want to create a resultant data frame which ranks the date_col in increasing order:
>>> df.show()
---------- ------ ---------
| date_col|counts|order_col|
---------- ------ ---------
|2022-02-05|350647| 2|
|2022-02-06|313091| 1|
---------- ------ ---------
How can we achieve this?
Following script can be used to created dataframe df:
from datetime import datetime, date
from pyspark.sql import Row
from pyspark.sql import SparkSession
df = spark.createDataFrame([
Row(date_col=date(2022, 02, 05), count=350647 ),
Row(date_col=date(2022, 02, 06), count=313091 ),
])
df.show()
CodePudding user response:
You can easily do this with Rank , in conjunction with Window
Data Preparation
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext.getOrCreate()
sql = SQLContext(sc)
d = {
'date_col':['2022-02-05', '2022-02-06', '2022-02-07', '2022-02-08'],
'counts':[350647, 313091, 317791, 312145],
}
sparkDF = sql.createDataFrame(pd.DataFrame(d))
sparkDF.show()
---------- ------
| date_col|counts|
---------- ------
|2022-02-05|350647|
|2022-02-06|313091|
|2022-02-07|317791|
|2022-02-08|312145|
---------- ------
Rank
window = Window.orderBy(F.col('date_col').desc())
sparkDF = sparkDF.withColumn('order_col',F.rank().over(window))
sparkDF.show()
---------- ------ ---------
| date_col|counts|order_col|
---------- ------ ---------
|2022-02-08|312145| 1|
|2022-02-07|317791| 2|
|2022-02-06|313091| 3|
|2022-02-05|350647| 4|
---------- ------ ---------
Rank - SparkSQL
sql.sql(
"""
SELECT
date_col
,counts
,RANK() OVER( ORDER BY date_col DESC) as order_col
FROM TB1
"""
).show()
---------- ------ ---------
| date_col|counts|order_col|
---------- ------ ---------
|2022-02-08|312145| 1|
|2022-02-07|317791| 2|
|2022-02-06|313091| 3|
|2022-02-05|350647| 4|
---------- ------ ---------