Spark: Split String with single separator into key/value dataframe columns-CodePudding

This seems like it should be relatively straightforward but I haven't been able to find an example of how to do this efficiently after scouring many resources.

I have a Spark DataFrame where each row is a single string with alternating keys and values separated by the same separator (space). It is formatted like so:

|               value                     |
| ----------------------------------------|
| key1 value1 key2 value2 key3 value3 ... |

My intent is to map this into a DataFrame that looks like this:

|  key1  |  key2  |  key3  | ... |
| ------ | ------ | ------ | --- |
| value1 | value2 | value3 | ... |

The names of the keys are not known ahead of time, nor is the number of pairs. However, I could make a solution work that started with a static list of keys we care about if that makes it workable.

I had hoped str_to_map might work but it does not when the key/value separator is the same as the pair separator. I could do df.select("value").as[String].flatMap(_.split(" ")) and then presumably somehow massage that array into a new DataFrame but I'm having trouble getting it right. Any ideas? Thank you.

CodePudding user response：

Doing something like this worked out alright, but did require collecting the keys we care about ahead of time.

val fields = Seq(...)
val fieldIndices = fields.zipWithIndex.toMap

val structFields = fields.map(f => StructField(f, StringType, nullable = false)
val schema = StructType(structFields)
val rowEncoder = RowEncoder.apply(schema)

val rowDS = inputDF.select($"value".cast(StringType))
  .as[String]
  .map(_.split(" "))
  .map(tokens => {
    val values = Array.fill(fields.length)("")
    tokens.grouped(2).foreach {
      case Array(k, v) if fieldIndices.contains(k) => values(fieldIndices(k)) = v
      case _                                       => ()
    }
    Row.fromSeq(values.toSeq)
  }) (rowEncoder)

Would still be interested in other efficient approaches.