I have a csv file look like this (it is saved from pyspark output)
name_value
"[quality1 -> good, quality2 -> OK, quality3 -> bad]"
"[quality1 -> good, quality2 -> excellent]"
how can I use pyspark to read this csv file and convert name_value column into a map type?
CodePudding user response:
Something like the below
data = {}
line = '[quality1 -> good, quality2 -> OK, quality3 -> bad]'
parts = line[1:-1].split(',')
for part in parts:
k,v = part.split('->')
data[k.strip()] = v.strip()
print(data)
output
{'quality1': 'good', 'quality2': 'OK', 'quality3': 'bad'}
CodePudding user response:
Using a combination of split
and regexp_replace
cuts the string into key value pairs. In a second step each key value pair is transformed first into a struct and then into a map element:
from pyspark.sql import functions as F
df=spark.read.option("header","true").csv(...)
df1=df.withColumn("name_value", F.split(F.regexp_replace("name_value", "[\\[\\]]", ""),",")) \
.withColumn("name_value", F.map_from_entries(F.expr("""transform(name_value, e -> (regexp_extract(e, '^(.*) ->',1),regexp_extract(e, '-> (.*)$',1)))""")))
df1
has now the schema
root
|-- name_value: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
and contains the same data like the original csv file.