Home > Mobile >  Adding a index to a dataframe by grouping the data
Adding a index to a dataframe by grouping the data

Time:11-18

I have a problem in grouping the data and adding the index.

A new column index is to be added starting with 1 to n(eg: 5) and iterate from 1 again. The value can be anything so basically after every n records the index should restart with 1.

original dataframe

city id
NYC 101
Jersey City 102
Hoboken 103
Buffalo 104
Philly 105
Edison 106

Output dataframe something should look like

city id index
NYC 101 1
Jersey City 102 2
Hoboken 103 3
Buffalo 104 4
Philly 105 5
Edison 106 1
trenton 107 2

CodePudding user response:

Try this:

data
  .withColumn("groupId", ceil(col("id") / lit(5)))
  .withColumn("index", row_number() over Window.partitionBy("groupId").orderBy("id"))
  .drop(col("groupId"))

Output (tested):

 ----------- --- ----- 
|       city| id|index|
 ----------- --- ----- 
|        NYC|101|    1|
|Jersey City|102|    2|
|    Hoboken|103|    3|
|    Buffalo|104|    4|
|     Philly|105|    5|
|     Edison|106|    1|
|    Trenton|107|    2|
 ----------- --- ----- 

CodePudding user response:

Have you tried ntile in spark? It will make sure you have the id's placed in 5 buckets ranking 1 to 5 always.

WindowSpec window = Window.orderBy(col("id").asc());
dataset.withColumn("index", ntile(5).over(window)).show();
  • Related