This seems like it should be relatively straightforward but I haven't been able to find an example of how to do this efficiently after scouring many resources.
I have a Spark DataFrame where each row is a single string with alternating keys and values separated by the same separator (space). It is formatted like so:
| value |
| ----------------------------------------|
| key1 value1 key2 value2 key3 value3 ... |
My intent is to map this into a DataFrame that looks like this:
| key1 | key2 | key3 | ... |
| ------ | ------ | ------ | --- |
| value1 | value2 | value3 | ... |
The names of the keys are not known ahead of time, nor is the number of pairs. However, I could make a solution work that started with a static list of keys we care about if that makes it workable.
I had hoped str_to_map
might work but it does not when the key/value separator is the same as the pair separator. I could do df.select("value").as[String].flatMap(_.split(" "))
and then presumably somehow massage that array into a new DataFrame but I'm having trouble getting it right. Any ideas? Thank you.
CodePudding user response:
Doing something like this worked out alright, but did require collecting the keys we care about ahead of time.
val fields = Seq(...)
val fieldIndices = fields.zipWithIndex.toMap
val structFields = fields.map(f => StructField(f, StringType, nullable = false)
val schema = StructType(structFields)
val rowEncoder = RowEncoder.apply(schema)
val rowDS = inputDF.select($"value".cast(StringType))
.as[String]
.map(_.split(" "))
.map(tokens => {
val values = Array.fill(fields.length)("")
tokens.grouped(2).foreach {
case Array(k, v) if fieldIndices.contains(k) => values(fieldIndices(k)) = v
case _ => ()
}
Row.fromSeq(values.toSeq)
}) (rowEncoder)
Would still be interested in other efficient approaches.