Home > Software engineering >  Create a Dataframe based on ranges of other Dataframe
Create a Dataframe based on ranges of other Dataframe

Time:12-03

I have a Spark Dataframe containing ranges of numbers (column start and column end), and a column containing the type of this range. I want to create a new Dataframe with two columns, the first one lists all ranges (incremented by one), and the second one lists the range's type. To explain more, this is the input Dataframe :

 ------- ------ --------- 
| start |  end |    type |
 ------- ------ --------- 
|  10   |  20  | LOW     |
|  21   |  30  | MEDIUM  |
|  31   |  40  | HIGH    |
 ------- ------ --------- 

And this is the desired result :

 ------- --------- 
| nbr   |    type |
 ------- --------- 
|  10   | LOW     |
|  11   | LOW     |
|  12   | LOW     |
|  13   | LOW     |
|  14   | LOW     |
|  15   | LOW     |
|  16   | LOW     |
|  17   | LOW     |
|  18   | LOW     |
|  19   | LOW     |
|  20   | LOW     |
|  21   | MEDIUM  |
|  22   | MEDIUM  |
|   ..  | ...     |
 ------- --------- 

Any ideas ?

CodePudding user response:

Try this.

val data = List((10, 20, "Low"), (21, 30, "MEDIUM"), (31, 40, "High"))

  import spark.implicits._
  val df = data.toDF("start", "end", "type") 
  
  df.withColumn("nbr", explode(sequence($"start", $"end"))).drop("start","end").show(false)

output:

 ------ --- 
|type  |nbr|
 ------ --- 
|Low   |10 |
|Low   |11 |
|Low   |12 |
|Low   |13 |
|Low   |14 |
|Low   |15 |
|Low   |16 |
|Low   |17 |
|Low   |18 |
|Low   |19 |
|Low   |20 |
|MEDIUM|21 |
|MEDIUM|22 |
|MEDIUM|23 |
|MEDIUM|24 |
|MEDIUM|25 |
|MEDIUM|26 |
|MEDIUM|27 |
|MEDIUM|28 |
|MEDIUM|29 |
 ------ --- 
only showing top 20 rows

CodePudding user response:

The solution provided by @Learn-Hadoop works if you're on Spark 2.4 .

For older Spark version, consider creating a simple UDF to mimic the sequence function:

val sequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower   1)(_   1)
}
 
df.withColumn("nbr",explode(sequence($"start",$"end"))).drop("start","end").show(false)
  • Related