I have a dataframe where one column contains several information in a 'key=value' format. There are almost a 30 different 'key=value' that can appear in that column will use 4 columns for understanding ( _age, _city, _sal, _tag)
id name properties
0 A {_age=10, _city=A, _sal=1000}
1 B {_age=20, _city=B, _sal=3000, tag=XYZ}
2 C {_city=BC, tag=ABC}
How can I convert this string column into multiple columns? Need to use spark scala dataframe for it.
The expected output is:
id name _age _city _sal tag
0 A 10 A 1000
1 B 20 B 3000 XYZ
2 C BC ABC
CodePudding user response:
Short answer
df
.select(
col("id"),
col("name"),
col("properties.*"),
..
)
CodePudding user response:
Try this :
val s = df.withColumn("dummy", explode(split(regexp_replace($"properties", "\\{|\\}", ""), ",")))
val result= s.drop("properties").withColumn("col1",split($"dummy","=")(0)).withColumn("col1-value",split($"dummy","=")(1)).drop("dummy")
result.groupBy("id","name").pivot("col1").agg(first($"col1-value")).orderBy($"id").show