How can I add an index column to dataframe and sort by groups based on another column-CodePudding

There is a dataframe as below:

 ----- 
|  Cur|
 ----- 
|  HKD|
|  HKD|
|  HKD|
|  USD|
|  USD|
|  USD|
|  USD|
|  INR|
|  INR|
|  INR|
 -----

And my desired output:

 ----- ----- 
|  Idx|  Cur|
 ----- ----- 
|    1|  HKD|
|    2|  HKD|
|    3|  HKD|
|    1|  USD|
|    2|  USD|
|    3|  USD|
|    4|  USD|
|    1|  INR|
|    2|  INR|
|    3|  INR|
 ----- -----

I tried several solutions online such as .sort(), .sortlevel(), .orderBy() but none of them can help me to solve the problem. Hope there's a solution for me to sort the dataframe by groups, thanks.

CodePudding user response：

You can use window functions to achieve your desired output,

from pyspark.sql import Window
import pyspark.sql.functions as F

df = spark.createDataFrame(['HKD', 'HKD', 'HKD', 'USD', 'USD', 'USD', 'USD', 'INR', 'INR', 'INR'], 'string').toDF("Cur")

print("Original Data:")
df.show()

print("Result:")
df.withColumn("Idx", F.row_number().over(Window.partitionBy("Cur").orderBy("Cur"))).show()

Output:

Original Data:
 --- 
|Cur|
 --- 
|HKD|
|HKD|
|HKD|
|USD|
|USD|
|USD|
|USD|
|INR|
|INR|
|INR|
 --- 

Result:
 --- --- 
|Cur|Idx|
 --- --- 
|HKD|  1|
|HKD|  2|
|HKD|  3|
|INR|  1|
|INR|  2|
|INR|  3|
|USD|  1|
|USD|  2|
|USD|  3|
|USD|  4|
 --- ---