I have the following dataframe that is extracted with the following command:
extract = data.select('properties.id', 'flags')
| id | flags |
|-------| ---------------------------|
| v_001 | "{"93":true,"83":true}" |
| v_002 | "{"45":true,"76":true}" |
The desired result I want is:
| id | flags |
|-------| ------|
| v_001 | 93 |
| v_001 | 83 |
| v_002 | 45 |
| v_002 | 76 |
I tried to apply explode
as the following:
extract = data.select('properties.id', explode(col('flags')))
But I encountered the following:
cannot resolve 'explode(flags)' due to data type mismatch: input to function explode should be array or map type, not struct<93:boolean,83:boolean,45:boolean,76:boolean>
This makes sense as the schema of the column is not compatible with the explode
function. How can I adjust the function to get my desired result? Is there a better way to solve this problem?
P.D.: The desired table schema is not the best design but this is out of my scope since this will involve another topic discussion.
CodePudding user response:
As you might already looked, explode
requires ArrayType
and it seems you are only taking the keys from the dict in flags
.
So, you can first convert the flags
to MapType
and use map_keys
to extract all keys into list.
df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
This will result in like this
----- --------
| id| flags|
----- --------
|v_001|[93, 83]|
|v_002|[45, 76]|
----- --------
Then you can use explode
on the flags
.
.select('id', F.explode('flags'))
----- ---
| id|col|
----- ---
|v_001| 93|
|v_001| 83|
|v_002| 45|
|v_002| 76|
----- ---
The whole code
df = (df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))
Update
It is probably better to supply the schema and read as MapType
for the flags but if your json is complex and hard to create the schema, you can convert the struct
into String
once then convert to MapType
.
# Add this line before `from_json`
df = df.select('id', F.to_json('flags').alias('flags'))
# Or you can do in 1 shot.
df = (df.withColumn('flags', F.map_keys(F.from_json(F.to_json('flags'), MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))