I have a problem in grouping the data and adding the index.
A new column index is to be added starting with 1 to n(eg: 5) and iterate from 1 again. The value can be anything so basically after every n records the index should restart with 1.
original dataframe
city | id |
---|---|
NYC | 101 |
Jersey City | 102 |
Hoboken | 103 |
Buffalo | 104 |
Philly | 105 |
Edison | 106 |
Output dataframe something should look like
city | id | index |
---|---|---|
NYC | 101 | 1 |
Jersey City | 102 | 2 |
Hoboken | 103 | 3 |
Buffalo | 104 | 4 |
Philly | 105 | 5 |
Edison | 106 | 1 |
trenton | 107 | 2 |
CodePudding user response:
Try this:
data
.withColumn("groupId", ceil(col("id") / lit(5)))
.withColumn("index", row_number() over Window.partitionBy("groupId").orderBy("id"))
.drop(col("groupId"))
Output (tested):
----------- --- -----
| city| id|index|
----------- --- -----
| NYC|101| 1|
|Jersey City|102| 2|
| Hoboken|103| 3|
| Buffalo|104| 4|
| Philly|105| 5|
| Edison|106| 1|
| Trenton|107| 2|
----------- --- -----
CodePudding user response:
Have you tried ntile in spark? It will make sure you have the id's placed in 5 buckets ranking 1 to 5 always.
WindowSpec window = Window.orderBy(col("id").asc());
dataset.withColumn("index", ntile(5).over(window)).show();