I cannot create the entire PySpark Dataframe that I need. My current dictionary is in this format:
d = {0:
{'Key Features': ['Obese', 'Exercise']},
'Properties': {'Balding': True, 'Tall': False, 'Obese': True, 'Exercise': False}},
1:
{'Key Features': [None]},
'Properties': {'Balding': True, 'Tall': False, 'Obese': False, 'Exercise': True}},
...}
I want to create a dataframe in this format:
--------- ------ ------- ---------- ---------------------
|'Balding'|'Tall'|'Obese'|'Exercise'| 'Key Features'|
--------- ------ ------- ---------- ---------------------
| true| false| false| false|['Obese', 'Exercise']|
--------- ------ ------- ---------- ---------------------
| true| false| false| true| [None]|
--------- ------ ------- ---------- ---------------------
I was able to create a DataFrame for the 'Properties' with this code:
df = spark.createDataFrame([d[i]['Properties'] for i in d]).show()
Which outputs this dataframe:
--------- ------ ------- ----------
|'Balding'|'Tall'|'Obese'|'Exercise'|
--------- ------ ------- ----------
| true| false| false| false|
--------- ------ ------- ----------
| true| false| false| true|
--------- ------ ------- ----------
I have tried to add a column like this and it failed:
df.withColumn('Key Features', array(lit([d[i]['Key Features'] for i in d])
But it simply fails and does not add the list as a column. And I tried to create a DataFrame like this and it also did not work:
df = spark.createDataFrame([d[i]['Key Features'] for i in d]).show()
Outputting:
Input row doesn't have expected number of values required by the schema. 4 fields are required while 1 values was provided.
How would I go about adding the 'Key Features' as a column with the list contained in the dictionary either by adding it at the start of the createDataFrame or using withColumn?
CodePudding user response:
I think your example input d
is a bit malformed since it puts 'Properties'
at the same level as 0
and 1
, and there are multiple 'Properties'
keys at the top level as a result. Given how you index into d
, I'm going to assume d
looks like this. Let me know if my assumptions are wrong and I will try to correct the answer.
d = {
0: {
'Key Features': ['Obese', 'Exercise'],
'Properties': {'Balding': True, 'Tall': False, 'Obese': True, 'Exercise': False},
},
1: {
'Key Features': [None],
'Properties': {'Balding': True, 'Tall': False, 'Obese': False, 'Exercise': True},
},
}
You can create the dataframe you want using this.
df = spark.createDataFrame(
[
{"Key Features": v["Key Features"], **v["Properties"]}
for v in d.values()
]
)
df.show()
------- -------- ----------------- ----- -----
|Balding|Exercise| Key Features|Obese| Tall|
------- -------- ----------------- ----- -----
| true| false|[Obese, Exercise]| true|false|
| true| true| [null]|false|false|
------- -------- ----------------- ----- -----