Home > database >  Spark-Scala : Create split rows based on the value of other column
Spark-Scala : Create split rows based on the value of other column

Time:01-06

I have an Input as below

id size
1 4
2 2

output - If input is 4 (size column) split 4 times(1-4) and if input size column value is 2 split it 1-2 times.

id size
1 1
1 2
1 3
1 4
2 1
2 2

CodePudding user response:

You can create an array of sequence from 1 to size using sequence function and then to explode it:

import org.apache.spark.sql.functions._
val df = Seq((1,4), (2,2)).toDF("id", "size")
df
  .withColumn("size", explode(sequence(lit(1), col("size"))))
  .show(false)

The output would be:

 --- ---- 
|id |size|
 --- ---- 
|1  |1   |
|1  |2   |
|1  |3   |
|1  |4   |
|2  |1   |
|2  |2   |
 --- ---- 

CodePudding user response:

You could turn your size column into an incrementing sequence using Seq.range and then explode the arrays. Something like this:

import spark.implicits._
import org.apache.spark.sql.functions.{explode, col}

// Original dataframe
val df = Seq((1,4), (2,2)).toDF("id", "size")

// Mapping over this dataframe: turning each row into (idx, array)
val df_with_array = df
  .map(row => {
    (row.getInt(0), Seq.range(1, row.getInt(1)   1)) 
  }).toDF("id", "array")

df_with_array.show()
 --- ------------ 
| id|       array|
 --- ------------ 
|  1|[1, 2, 3, 4]|
|  2|      [1, 2]|
 --- ------------ 

// Finally selecting the wanted columns, exploding the array column
val output = df_with_array.select(col("id"), explode(col("array")))

output.show()
 --- --- 
| id|col|
 --- --- 
|  1|  1|
|  1|  2|
|  1|  3|
|  1|  4|
|  2|  1|
|  2|  2|
 --- --- 

CodePudding user response:

You can use first use sequence function to create sequence from 1 to size and then explode it.

val df = input.withColumn("seq", sequence(lit(1), $"size"))
df.show()
 --- ---- ------------ 
| id|size|         seq|
 --- ---- ------------ 
|  1|   4|[1, 2, 3, 4]|
|  2|   2|      [1, 2]|
 --- ---- ------------ 

df.withColumn("size", explode($"seq")).drop("seq").show()
 --- ---- 
| id|size|
 --- ---- 
|  1|   1|
|  1|   2|
|  1|   3|
|  1|   4|
|  2|   1|
|  2|   2|
 --- ---- 
  • Related