Home > Net >  Join list column with string column in PySpark
Join list column with string column in PySpark

Time:05-23

I have two data frames like df_emp and df_dept:

df_emp:

id Name
1   aaa
2   bbb
3   ccc
4   ddd

df_dept:

dept_id dept_name employees
1              DE    [1, 2]
2              DA    [3, 4]

The expected result after joining:

dept_name employees employee_names
       DE    [1, 2]     [aaa, bbb]
       DA    [3, 4]     [ccc, ddd]

Any idea how to do it using simple joins or udf's?

CodePudding user response:

It can be done without UDF. First explode the array, then join and group.

Input data:

from pyspark.sql import functions as F
df_emp = spark.createDataFrame(
    [(1, 'aaa'),
     (2, 'bbb'),
     (3, 'ccc'),
     (4, 'ddd')],
    ['id', 'Name']
)
df_dept = spark.createDataFrame(
    [(1, 'DE', [1, 2]),
     (2, 'DA', [3, 4])],
    ['dept_id', 'dept_name', 'employees']
)

Script:

df_dept_exploded = df_dept.withColumn('id', F.explode('employees'))
df_joined = df_dept_exploded.join(df_emp, 'id', 'left')
df = (
    df_joined
    .groupBy('dept_name')
    .agg(
        F.collect_list('id').alias('employees'),
        F.collect_list('Name').alias('employee_names')
    )
)
df.show()
#  --------- --------- -------------- 
# |dept_name|employees|employee_names|
#  --------- --------- -------------- 
# |       DE|   [1, 2]|    [aaa, bbb]|
# |       DA|   [3, 4]|    [ccc, ddd]|
#  --------- --------- -------------- 
  • Related