name | contact | address |
---|---|---|
"max" | [{"email": "[email protected]", "phone": "650-333-3456"}, {"email": "[email protected]", "phone": "238-111-7689"}] | {"city": "Baltimore", "state": "MD"} |
"kyle" | [{"email": "[email protected]", "phone": "425-231-8754"}] | {"city": "Barton", "state": "TN"} |
I am working with a dataframe in Pyspark that has a few columns including the two mentioned above. I need to create columns dynamically based on the contact fields.
When I use the "." operator on contact
as contact.email
I get a list of emails. I need to create separate column for each of the emails.
contact.email0
, contact.email1
, etc.
I found this code online, which partially does what I want, but I don't completely understand it.
employee_data.select(
'name', *[col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]).show(truncate=False)
The range is static in this case, but my range could be dynamic. How can I get the size of list to loop through it? I tried size(col('contact.email'))
or len(col('contact.email'))
but got an error saying the col('column name')
object is not iterable.
Desired output something like -
name | contact.email0 | contact.email1 |
---|---|---|
max | [email protected] | [email protected] |
kyle | [email protected] | null |
CodePudding user response:
You can get desired output by using pivot
function,
# convert contact struct to array of emails by using transform function
# explode the array
# perform pivot
df.select("name", posexplode_outer(expr("transform(contact, c-> c.email)"))) \
.withColumn("email", concat(lit("contact.email"), col("pos"))) \
.groupBy("name").pivot("email").agg(first("col")) \
.show(truncate=False)
---- ------------------- ---------------
|name|contact.email0 |contact.email1 |
---- ------------------- ---------------
|kyle|[email protected]|null |
|max |[email protected]|[email protected]|
---- ------------------- ---------------
CodePudding user response:
To understand what the solution you found does, we can print the expression in a shell:
>>> [F.col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]
[Column<'contact.email[0] AS `contact.email0`'>, Column<'contact.email[1] AS `contact.email1`'>]
Basically, it creates two columns, one for the first element of the array contact.email
and one for the second element. That's all there is to it.
SOLUTION 1
Keep this solution. But you need to find the max size of your array first:
max_size = df.select(F.max(F.size("contact"))).first()[0]
df.select('name',
*[F.col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(max_size)])\
.show(truncate=False)
SOLUTION 2
Use posexplode
to generate one row per element of the array a pos column containing the index of the email in the array. Then use a pivot to create the columns you want.
df.select('name', F.posexplode('contact.email').alias('pos', 'email'))\
.withColumn('pos', F.concat(F.lit('contact.email'), 'pos'))\
.groupBy('name')\
.pivot('pos')\
.agg(F.first('email'))\
.show()
Both solutions yield:
---- ------------------- ---------------
|name|contact.email0 |contact.email1 |
---- ------------------- ---------------
|max |[email protected]|[email protected]|
|kyle|[email protected]|null |
---- ------------------- ---------------