Home > Back-end >  How do I add multiple values from a dictionary to a PySpark Dataframe
How do I add multiple values from a dictionary to a PySpark Dataframe

Time:12-23

I cannot create the entire PySpark Dataframe that I need. My current dictionary is in this format:

d = {0:   
{'Key Features': ['Obese', 'Exercise']},  
'Properties': {'Balding': True, 'Tall': False, 'Obese': True, 'Exercise': False}},  
1:  
{'Key Features': [None]},  
'Properties': {'Balding': True, 'Tall': False, 'Obese': False, 'Exercise': True}},  
...}  

I want to create a dataframe in this format:

 --------- ------ ------- ---------- ---------------------   
|'Balding'|'Tall'|'Obese'|'Exercise'|       'Key Features'|  
 --------- ------ ------- ---------- ---------------------   
|     true| false|  false|     false|['Obese', 'Exercise']|  
 --------- ------ ------- ---------- ---------------------   
|     true| false|  false|      true|               [None]|  
 --------- ------ ------- ---------- ---------------------   

I was able to create a DataFrame for the 'Properties' with this code:

df = spark.createDataFrame([d[i]['Properties'] for i in d]).show()  

Which outputs this dataframe:

 --------- ------ ------- ---------- 
|'Balding'|'Tall'|'Obese'|'Exercise'|
 --------- ------ ------- ---------- 
|     true| false|  false|     false|
 --------- ------ ------- ---------- 
|     true| false|  false|      true|
 --------- ------ ------- ---------- 

I have tried to add a column like this and it failed:

df.withColumn('Key Features', array(lit([d[i]['Key Features'] for i in d]) 

But it simply fails and does not add the list as a column. And I tried to create a DataFrame like this and it also did not work:

df = spark.createDataFrame([d[i]['Key Features'] for i in d]).show()  

Outputting: Input row doesn't have expected number of values required by the schema. 4 fields are required while 1 values was provided.
How would I go about adding the 'Key Features' as a column with the list contained in the dictionary either by adding it at the start of the createDataFrame or using withColumn?

CodePudding user response:

I think your example input d is a bit malformed since it puts 'Properties' at the same level as 0 and 1, and there are multiple 'Properties' keys at the top level as a result. Given how you index into d, I'm going to assume d looks like this. Let me know if my assumptions are wrong and I will try to correct the answer.

d = {
    0: {
        'Key Features': ['Obese', 'Exercise'],
        'Properties': {'Balding': True, 'Tall': False, 'Obese': True, 'Exercise': False},
    },
    1: {
        'Key Features': [None],
        'Properties': {'Balding': True, 'Tall': False, 'Obese': False, 'Exercise': True},
    },
}

You can create the dataframe you want using this.

df = spark.createDataFrame(
    [
        {"Key Features": v["Key Features"], **v["Properties"]}
        for v in d.values()
    ]
)
df.show()
 ------- -------- ----------------- ----- ----- 
|Balding|Exercise|     Key Features|Obese| Tall|
 ------- -------- ----------------- ----- ----- 
|   true|   false|[Obese, Exercise]| true|false|
|   true|    true|           [null]|false|false|
 ------- -------- ----------------- ----- ----- 
  • Related