Home > Software design >  Pyspark - Insert List to Dataframe Cell
Pyspark - Insert List to Dataframe Cell

Time:12-12

I have a dictionary and I want to add the list of keys of the dictionary to every cell of a column in a dataframe. At the moment I haven't been successful with my attempts and I am not sure why.

The dictionary looks like this:

my_dict = {"A":"1","B":"2","C":"3","D":"4"}

I want to add the keys of my_dict to a dataframe column so the end result looks like this:

 ------------ ------------ ------------ 
|       comb1|       comb2|        colA|
 ------------ ------------ ------------ 
|          YY|         XX |[A, B, C, D]|
 ------------ ------------ ------------ 

The goal is for the next step to be an explode so the dataframe looks like this:

 ------------ ------------ ------------ 
|       comb1|       comb2|        colA|
 ------------ ------------ ------------ 
|          YY|         XX |           A|
 ------------ ------------ ------------ 
|          YY|         XX |           B|
 ------------ ------------ ------------ 
|          YY|         XX |           C|
 ------------ ------------ ------------ 
|          YY|         XX |           D|
 ------------ ------------ ------------ 

How can I insert the dictionary keys on every row of a column to then explode ?

CodePudding user response:

You can create some additional constant columns with the keys of your dictionary, then create an array out of them and finally explode that column.

The code it's easier than the explanation:

from pyspark.sql import functions as F

# create temporary constant columns with the keys of the dictionary
for k in my_dict.keys():
    df = df.withColumn(f'_temp_{k}', F.lit(k))


df = (
    df
    # add a column with an array collecting all the keys
    .withColumn('colA', F.array(*[f'_temp_{k}' for k in my_dict.keys()]))
    # drop the temporary columns
    .drop(*[f'_temp_{k}' for k in my_dict.keys()])
    # explode the column with the array
    .withColumn('colA', F.explode(F.col('colA')))
)

The resulting df is:

 ----- ----- ---- 
|comb1|comb2|colA|
 ----- ----- ---- 
|   YY|   XX|   A|
|   YY|   XX|   B|
|   YY|   XX|   C|
|   YY|   XX|   D|
 ----- ----- ---- 
  • Related