I have an Input as below
id | size |
---|---|
1 | 4 |
2 | 2 |
output - If input is 4 (size column) split 4 times(1-4) and if input size column value is 2 split it 1-2 times.
id | size |
---|---|
1 | 1 |
1 | 2 |
1 | 3 |
1 | 4 |
2 | 1 |
2 | 2 |
CodePudding user response:
You can create an array of sequence from 1 to size
using sequence
function and then to explode it:
import org.apache.spark.sql.functions._
val df = Seq((1,4), (2,2)).toDF("id", "size")
df
.withColumn("size", explode(sequence(lit(1), col("size"))))
.show(false)
The output would be:
--- ----
|id |size|
--- ----
|1 |1 |
|1 |2 |
|1 |3 |
|1 |4 |
|2 |1 |
|2 |2 |
--- ----
CodePudding user response:
You could turn your size
column into an incrementing sequence using Seq.range
and then explode the arrays. Something like this:
import spark.implicits._
import org.apache.spark.sql.functions.{explode, col}
// Original dataframe
val df = Seq((1,4), (2,2)).toDF("id", "size")
// Mapping over this dataframe: turning each row into (idx, array)
val df_with_array = df
.map(row => {
(row.getInt(0), Seq.range(1, row.getInt(1) 1))
}).toDF("id", "array")
df_with_array.show()
--- ------------
| id| array|
--- ------------
| 1|[1, 2, 3, 4]|
| 2| [1, 2]|
--- ------------
// Finally selecting the wanted columns, exploding the array column
val output = df_with_array.select(col("id"), explode(col("array")))
output.show()
--- ---
| id|col|
--- ---
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
--- ---
CodePudding user response:
You can use first use sequence function to create sequence from 1 to size and then explode it.
val df = input.withColumn("seq", sequence(lit(1), $"size"))
df.show()
--- ---- ------------
| id|size| seq|
--- ---- ------------
| 1| 4|[1, 2, 3, 4]|
| 2| 2| [1, 2]|
--- ---- ------------
df.withColumn("size", explode($"seq")).drop("seq").show()
--- ----
| id|size|
--- ----
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
--- ----