how to convert string of mapping to mapping in pyspark-CodePudding

I have a csv file look like this (it is saved from pyspark output)

name_value
"[quality1 -> good, quality2 -> OK, quality3 -> bad]"
"[quality1 -> good, quality2 -> excellent]"

how can I use pyspark to read this csv file and convert name_value column into a map type?

CodePudding user response：

Something like the below

data = {}
line = '[quality1 -> good, quality2 -> OK, quality3 -> bad]'
parts = line[1:-1].split(',')
for part in parts:
   k,v = part.split('->')
   data[k.strip()] = v.strip()

print(data)

output

{'quality1': 'good', 'quality2': 'OK', 'quality3': 'bad'}

CodePudding user response：

Using a combination of split and regexp_replace cuts the string into key value pairs. In a second step each key value pair is transformed first into a struct and then into a map element:

from pyspark.sql import functions as F

df=spark.read.option("header","true").csv(...)

df1=df.withColumn("name_value", F.split(F.regexp_replace("name_value", "[\\[\\]]", ""),",")) \
  .withColumn("name_value", F.map_from_entries(F.expr("""transform(name_value, e -> (regexp_extract(e, '^(.*) ->',1),regexp_extract(e, '-> (.*)$',1)))""")))

df1 has now the schema

root
 |-- name_value: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

and contains the same data like the original csv file.