Home > front end >  Spark: programmatically populate columns from array values
Spark: programmatically populate columns from array values

Time:03-09

I have a column that is a list of identifiers (in this case runways). It could be an array or a comma-separated list, in this example I'm converting it to an array. I'm trying to figure out the idiomatic/programmatic way to update a set of columns based on the contents of said array.

Working example that uses anti-pattern:

        val data = Seq("08L,08R,09")
        val df = data.toDF("runways")
          .withColumn("runway_set", split('runways, ","))
          .withColumn("runway_in_use_08L", when(array_contains('runway_set, "08L"), 1).otherwise(0))
          .withColumn("runway_in_use_26R", when(array_contains('runway_set, "26R"), 1).otherwise(0))
          .withColumn("runway_in_use_08R", when(array_contains('runway_set, "08R"), 1).otherwise(0))
          .withColumn("runway_in_use_26L", when(array_contains('runway_set, "26L"), 1).otherwise(0))
          .withColumn("runway_in_use_09", when(array_contains('runway_set, "09"), 1).otherwise(0))
          .withColumn("runway_in_use_27", when(array_contains('runway_set, "27"), 1).otherwise(0))
          .withColumn("runway_in_use_15L", when(array_contains('runway_set, "15L"), 1).otherwise(0))
          .withColumn("runway_in_use_33R", when(array_contains('runway_set, "33R"), 1).otherwise(0))
          .withColumn("runway_in_use_15R", when(array_contains('runway_set, "15R"), 1).otherwise(0))
          .withColumn("runway_in_use_33L", when(array_contains('runway_set, "33L"), 1).otherwise(0))

This produces essentially one-hot encoded values like so:

 ---------- -------------- ----------------- ----------------- ----------------- ----------------- ---------------- ---------------- ----------------- ----------------- ----------------- ----------------- 
|   runways|    runway_set|runway_in_use_08L|runway_in_use_26R|runway_in_use_08R|runway_in_use_26L|runway_in_use_09|runway_in_use_27|runway_in_use_15L|runway_in_use_33R|runway_in_use_15R|runway_in_use_33L|
 ---------- -------------- ----------------- ----------------- ----------------- ----------------- ---------------- ---------------- ----------------- ----------------- ----------------- ----------------- 
|08L,08R,09|[08L, 08R, 09]|                1|                0|                1|                0|               1|               0|                0|                0|                0|                0|
 ---------- -------------- ----------------- ----------------- ----------------- ----------------- ---------------- ---------------- ----------------- ----------------- ----------------- ----------------- 

Feels like I should be able to take a static sequence of all identifiers and perform some programmatic operation to do all of the above in a loop/map/foreach type of expression, but I am not sure how to formulate it.

E.g.:

val all_runways = Seq("08L","26R","08R","26L","09","27","15L","33R","15R","33L")
// iterate through and update each column, e.g. 'runway_in_use_$i'

Any pointers? Thanks in advance.

CodePudding user response:

Typical use case for fold.

val df = data.toDF("runways")
  .withColumn("runway_set", split('runways, ","))

val df2 = all_runways.foldLeft(df) { (acc, x) =>
  acc.withColumn(s"runway_in_use_$x", when(array_contains('runway_set, x), 1).otherwise(0))
}
  • Related