Home > OS >  Spark: Split String with single separator into key/value dataframe columns
Spark: Split String with single separator into key/value dataframe columns

Time:11-05

This seems like it should be relatively straightforward but I haven't been able to find an example of how to do this efficiently after scouring many resources.

I have a Spark DataFrame where each row is a single string with alternating keys and values separated by the same separator (space). It is formatted like so:

|               value                     |
| ----------------------------------------|
| key1 value1 key2 value2 key3 value3 ... |

My intent is to map this into a DataFrame that looks like this:

|  key1  |  key2  |  key3  | ... |
| ------ | ------ | ------ | --- |
| value1 | value2 | value3 | ... |

The names of the keys are not known ahead of time, nor is the number of pairs. However, I could make a solution work that started with a static list of keys we care about if that makes it workable.

I had hoped str_to_map might work but it does not when the key/value separator is the same as the pair separator. I could do df.select("value").as[String].flatMap(_.split(" ")) and then presumably somehow massage that array into a new DataFrame but I'm having trouble getting it right. Any ideas? Thank you.

CodePudding user response:

Doing something like this worked out alright, but did require collecting the keys we care about ahead of time.

val fields = Seq(...)
val fieldIndices = fields.zipWithIndex.toMap

val structFields = fields.map(f => StructField(f, StringType, nullable = false)
val schema = StructType(structFields)
val rowEncoder = RowEncoder.apply(schema)

val rowDS = inputDF.select($"value".cast(StringType))
  .as[String]
  .map(_.split(" "))
  .map(tokens => {
    val values = Array.fill(fields.length)("")
    tokens.grouped(2).foreach {
      case Array(k, v) if fieldIndices.contains(k) => values(fieldIndices(k)) = v
      case _                                       => ()
    }
    Row.fromSeq(values.toSeq)
  }) (rowEncoder)

Would still be interested in other efficient approaches.

  • Related