How to get the size of a list returned by column in pyspark-CodePudding

name	contact	address
"max"	[{"email": "[email protected]", "phone": "650-333-3456"}, {"email": "[email protected]", "phone": "238-111-7689"}]	{"city": "Baltimore", "state": "MD"}
"kyle"	[{"email": "[email protected]", "phone": "425-231-8754"}]	{"city": "Barton", "state": "TN"}

I am working with a dataframe in Pyspark that has a few columns including the two mentioned above. I need to create columns dynamically based on the contact fields.

When I use the "." operator on contact as contact.email I get a list of emails. I need to create separate column for each of the emails. contact.email0, contact.email1, etc.

I found this code online, which partially does what I want, but I don't completely understand it.

employee_data.select(
    'name', *[col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]).show(truncate=False)

The range is static in this case, but my range could be dynamic. How can I get the size of list to loop through it? I tried size(col('contact.email')) or len(col('contact.email')) but got an error saying the col('column name') object is not iterable.

Desired output something like -

name	contact.email0	contact.email1
max	[email protected]	[email protected]
kyle	[email protected]	null

CodePudding user response：

You can get desired output by using pivot function,

# convert contact struct to array of emails by using transform function
# explode the array 
# perform pivot
 df.select("name", posexplode_outer(expr("transform(contact, c-> c.email)"))) \
    .withColumn("email", concat(lit("contact.email"), col("pos"))) \
    .groupBy("name").pivot("email").agg(first("col")) \
    .show(truncate=False)

 ---- ------------------- --------------- 
|name|contact.email0     |contact.email1 |
 ---- ------------------- --------------- 
|kyle|[email protected]|null           |
|max |[email protected]|[email protected]|
 ---- ------------------- ---------------

CodePudding user response：

To understand what the solution you found does, we can print the expression in a shell:

>>> [F.col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]
[Column<'contact.email[0] AS `contact.email0`'>, Column<'contact.email[1] AS `contact.email1`'>]

Basically, it creates two columns, one for the first element of the array contact.email and one for the second element. That's all there is to it.

SOLUTION 1

Keep this solution. But you need to find the max size of your array first:

max_size = df.select(F.max(F.size("contact"))).first()[0]
df.select('name',
    *[F.col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(max_size)])\
  .show(truncate=False)

SOLUTION 2

Use posexplode to generate one row per element of the array a pos column containing the index of the email in the array. Then use a pivot to create the columns you want.

df.select('name', F.posexplode('contact.email').alias('pos', 'email'))\
  .withColumn('pos', F.concat(F.lit('contact.email'), 'pos'))\
  .groupBy('name')\
  .pivot('pos')\
  .agg(F.first('email'))\
  .show()

Both solutions yield:

 ---- ------------------- --------------- 
|name|contact.email0     |contact.email1 |
 ---- ------------------- --------------- 
|max |[email protected]|[email protected]|
|kyle|[email protected]|null           |
 ---- ------------------- ---------------