I have a column that is a list of identifiers (in this case runways). It could be an array or a comma-separated list, in this example I'm converting it to an array. I'm trying to figure out the idiomatic/programmatic way to update a set of columns based on the contents of said array.
Working example that uses anti-pattern:
val data = Seq("08L,08R,09")
val df = data.toDF("runways")
.withColumn("runway_set", split('runways, ","))
.withColumn("runway_in_use_08L", when(array_contains('runway_set, "08L"), 1).otherwise(0))
.withColumn("runway_in_use_26R", when(array_contains('runway_set, "26R"), 1).otherwise(0))
.withColumn("runway_in_use_08R", when(array_contains('runway_set, "08R"), 1).otherwise(0))
.withColumn("runway_in_use_26L", when(array_contains('runway_set, "26L"), 1).otherwise(0))
.withColumn("runway_in_use_09", when(array_contains('runway_set, "09"), 1).otherwise(0))
.withColumn("runway_in_use_27", when(array_contains('runway_set, "27"), 1).otherwise(0))
.withColumn("runway_in_use_15L", when(array_contains('runway_set, "15L"), 1).otherwise(0))
.withColumn("runway_in_use_33R", when(array_contains('runway_set, "33R"), 1).otherwise(0))
.withColumn("runway_in_use_15R", when(array_contains('runway_set, "15R"), 1).otherwise(0))
.withColumn("runway_in_use_33L", when(array_contains('runway_set, "33L"), 1).otherwise(0))
This produces essentially one-hot encoded values like so:
---------- -------------- ----------------- ----------------- ----------------- ----------------- ---------------- ---------------- ----------------- ----------------- ----------------- -----------------
| runways| runway_set|runway_in_use_08L|runway_in_use_26R|runway_in_use_08R|runway_in_use_26L|runway_in_use_09|runway_in_use_27|runway_in_use_15L|runway_in_use_33R|runway_in_use_15R|runway_in_use_33L|
---------- -------------- ----------------- ----------------- ----------------- ----------------- ---------------- ---------------- ----------------- ----------------- ----------------- -----------------
|08L,08R,09|[08L, 08R, 09]| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0|
---------- -------------- ----------------- ----------------- ----------------- ----------------- ---------------- ---------------- ----------------- ----------------- ----------------- -----------------
Feels like I should be able to take a static sequence of all identifiers and perform some programmatic operation to do all of the above in a loop/map/foreach type of expression, but I am not sure how to formulate it.
E.g.:
val all_runways = Seq("08L","26R","08R","26L","09","27","15L","33R","15R","33L")
// iterate through and update each column, e.g. 'runway_in_use_$i'
Any pointers? Thanks in advance.
CodePudding user response:
Typical use case for fold.
val df = data.toDF("runways")
.withColumn("runway_set", split('runways, ","))
val df2 = all_runways.foldLeft(df) { (acc, x) =>
acc.withColumn(s"runway_in_use_$x", when(array_contains('runway_set, x), 1).otherwise(0))
}