Home > Software engineering >  How to split a dictionary in a Pyspark dataframe into multiple rows?
How to split a dictionary in a Pyspark dataframe into multiple rows?

Time:12-09

I have the following dataframe that is extracted with the following command:

extract   = data.select('properties.id', 'flags')


|   id  | flags                      |
|-------| ---------------------------|
| v_001 | "{"93":true,"83":true}"    |
| v_002 | "{"45":true,"76":true}"    |

The desired result I want is:

|   id  | flags |
|-------| ------|
| v_001 | 93    |
| v_001 | 83    |
| v_002 | 45    |
| v_002 | 76    |

I tried to apply explode as the following:

extract   = data.select('properties.id', explode(col('flags')))

But I encountered the following:

 cannot resolve 'explode(flags)' due to data type mismatch: input to function explode should be array or map type, not struct<93:boolean,83:boolean,45:boolean,76:boolean>

This makes sense as the schema of the column is not compatible with the explode function. How can I adjust the function to get my desired result? Is there a better way to solve this problem?

P.D.: The desired table schema is not the best design but this is out of my scope since this will involve another topic discussion.

CodePudding user response:

As you might already looked, explode requires ArrayType and it seems you are only taking the keys from the dict in flags.

So, you can first convert the flags to MapType and use map_keys to extract all keys into list.

df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))

This will result in like this

 ----- -------- 
|   id|   flags|
 ----- -------- 
|v_001|[93, 83]|
|v_002|[45, 76]|
 ----- -------- 

Then you can use explode on the flags.

.select('id', F.explode('flags'))
 ----- --- 
|   id|col|
 ----- --- 
|v_001| 93|
|v_001| 83|
|v_002| 45|
|v_002| 76|
 ----- --- 

The whole code

df = (df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
      .select('id', F.explode('flags')))

Update

It is probably better to supply the schema and read as MapType for the flags but if your json is complex and hard to create the schema, you can convert the struct into String once then convert to MapType.

# Add this line before `from_json`
df = df.select('id', F.to_json('flags').alias('flags'))

# Or you can do in 1 shot.
df = (df.withColumn('flags', F.map_keys(F.from_json(F.to_json('flags'), MapType(StringType(), BooleanType()))))
      .select('id', F.explode('flags')))
  • Related