I have added a new column seq_col containing unique sequence using
val df2 = dfFromRDD1.withColumn("monotonically_increasing_id", monotonically_increasing_id())
val window = Window.orderBy(col("monotonically_increasing_id"))
val df3_consecutiveval = df2.withColumn("seq_col",row_number().over(window)).drop(col("monotonically_increasing_id").show()
dataframe:
col1 col2 seq_col
a aa 1
b ff 2
c rr 3
d yy 4
e tt 5
Now I want to add values to that new column in dataframe which will have data based on start and increment values specified like below example
Ex: Start = 100 increment = 3
dataframe:
col1 col2 seq_col
a aa 100
b ff 103
c rr 106
d yy 109
e tt 112
CodePudding user response:
You can define a udf
that is responsible to calculate the id with the given logic, for instance in this case:
val step = 3 // increment 3 by 3
val startOffset = 100 // you want it to start with 100
val calculateId = udf((rowNum: Int) => startOffset (rowNum * step))
df.withColumn("seq_col", calculateId(row_number().over(window))
This worked for me using some random dataframe.
CodePudding user response:
The above answer is technically correct, but you should avoid using udfs whenever possible for performance reasons. This case is so simple that basic arithmetic will do the trick:
scala> val df = Seq(("a", "aa"), ("b", "ff"), ("c", "rr"), ("d", "yy"), ("e", "tt")).toDF("col1", "col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> val start = 100
start: Int = 100
scala> val increment = 3
increment: Int = 3
scala> df.withColumn("seq_col", monotonically_increasing_id() * increment start).show
---- ---- -------
|col1|col2|seq_col|
---- ---- -------
| a| aa| 100|
| b| ff| 103|
| c| rr| 106|
| d| yy| 109|
| e| tt| 112|
---- ---- -------
scala>