I have a Spark Dataframe containing ranges of numbers (column start and column end), and a column containing the type of this range. I want to create a new Dataframe with two columns, the first one lists all ranges (incremented by one), and the second one lists the range's type. To explain more, this is the input Dataframe :
------- ------ ---------
| start | end | type |
------- ------ ---------
| 10 | 20 | LOW |
| 21 | 30 | MEDIUM |
| 31 | 40 | HIGH |
------- ------ ---------
And this is the desired result :
------- ---------
| nbr | type |
------- ---------
| 10 | LOW |
| 11 | LOW |
| 12 | LOW |
| 13 | LOW |
| 14 | LOW |
| 15 | LOW |
| 16 | LOW |
| 17 | LOW |
| 18 | LOW |
| 19 | LOW |
| 20 | LOW |
| 21 | MEDIUM |
| 22 | MEDIUM |
| .. | ... |
------- ---------
Any ideas ?
CodePudding user response:
Try this.
val data = List((10, 20, "Low"), (21, 30, "MEDIUM"), (31, 40, "High"))
import spark.implicits._
val df = data.toDF("start", "end", "type")
df.withColumn("nbr", explode(sequence($"start", $"end"))).drop("start","end").show(false)
output:
------ ---
|type |nbr|
------ ---
|Low |10 |
|Low |11 |
|Low |12 |
|Low |13 |
|Low |14 |
|Low |15 |
|Low |16 |
|Low |17 |
|Low |18 |
|Low |19 |
|Low |20 |
|MEDIUM|21 |
|MEDIUM|22 |
|MEDIUM|23 |
|MEDIUM|24 |
|MEDIUM|25 |
|MEDIUM|26 |
|MEDIUM|27 |
|MEDIUM|28 |
|MEDIUM|29 |
------ ---
only showing top 20 rows
CodePudding user response:
The solution provided by @Learn-Hadoop works if you're on Spark 2.4 .
For older Spark version, consider creating a simple UDF to mimic the sequence function:
val sequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower 1)(_ 1)
}
df.withColumn("nbr",explode(sequence($"start",$"end"))).drop("start","end").show(false)