adding sequence number to a dataframe-CodePudding

I have added a new column seq_col containing unique sequence using

  val df2 = dfFromRDD1.withColumn("monotonically_increasing_id", monotonically_increasing_id())
  val window = Window.orderBy(col("monotonically_increasing_id"))
  val df3_consecutiveval = df2.withColumn("seq_col",row_number().over(window)).drop(col("monotonically_increasing_id").show()

dataframe:

col1  col2 seq_col
a    aa   1
b    ff   2
c    rr   3
d    yy   4
e    tt   5

Now I want to add values to that new column in dataframe which will have data based on start and increment values specified like below example

Ex: Start = 100 increment = 3

dataframe:

col1  col2 seq_col
a    aa   100
b    ff   103
c    rr   106
d    yy   109
e    tt   112

CodePudding user response：

You can define a udf that is responsible to calculate the id with the given logic, for instance in this case:

val step = 3 // increment 3 by 3
val startOffset = 100 // you want it to start with 100
val calculateId = udf((rowNum: Int) => startOffset   (rowNum * step))

df.withColumn("seq_col", calculateId(row_number().over(window))

This worked for me using some random dataframe.

CodePudding user response：

The above answer is technically correct, but you should avoid using udfs whenever possible for performance reasons. This case is so simple that basic arithmetic will do the trick:

scala> val df = Seq(("a", "aa"), ("b", "ff"), ("c", "rr"), ("d", "yy"), ("e", "tt")).toDF("col1", "col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]

scala> val start = 100
start: Int = 100

scala> val increment = 3
increment: Int = 3

scala> df.withColumn("seq_col", monotonically_increasing_id() * increment   start).show
 ---- ---- ------- 
|col1|col2|seq_col|
 ---- ---- ------- 
|   a|  aa|    100|
|   b|  ff|    103|
|   c|  rr|    106|
|   d|  yy|    109|
|   e|  tt|    112|
 ---- ---- ------- 


scala>