Adding a index to a dataframe by grouping the data-CodePudding

I have a problem in grouping the data and adding the index.

A new column index is to be added starting with 1 to n(eg: 5) and iterate from 1 again. The value can be anything so basically after every n records the index should restart with 1.

original dataframe

city	id
NYC	101
Jersey City	102
Hoboken	103
Buffalo	104
Philly	105
Edison	106

Output dataframe something should look like

city	id	index
NYC	101	1
Jersey City	102	2
Hoboken	103	3
Buffalo	104	4
Philly	105	5
Edison	106	1
trenton	107	2

CodePudding user response：

Try this:

data
  .withColumn("groupId", ceil(col("id") / lit(5)))
  .withColumn("index", row_number() over Window.partitionBy("groupId").orderBy("id"))
  .drop(col("groupId"))

Output (tested):

 ----------- --- ----- 
|       city| id|index|
 ----------- --- ----- 
|        NYC|101|    1|
|Jersey City|102|    2|
|    Hoboken|103|    3|
|    Buffalo|104|    4|
|     Philly|105|    5|
|     Edison|106|    1|
|    Trenton|107|    2|
 ----------- --- -----

CodePudding user response：

Have you tried ntile in spark? It will make sure you have the id's placed in 5 buckets ranking 1 to 5 always.

WindowSpec window = Window.orderBy(col("id").asc());
dataset.withColumn("index", ntile(5).over(window)).show();