I have two data frames like df_emp
and df_dept
:
df_emp
:
id Name
1 aaa
2 bbb
3 ccc
4 ddd
df_dept
:
dept_id dept_name employees
1 DE [1, 2]
2 DA [3, 4]
The expected result after joining:
dept_name employees employee_names
DE [1, 2] [aaa, bbb]
DA [3, 4] [ccc, ddd]
Any idea how to do it using simple joins or udf's?
CodePudding user response:
It can be done without UDF. First explode
the array, then join and group.
Input data:
from pyspark.sql import functions as F
df_emp = spark.createDataFrame(
[(1, 'aaa'),
(2, 'bbb'),
(3, 'ccc'),
(4, 'ddd')],
['id', 'Name']
)
df_dept = spark.createDataFrame(
[(1, 'DE', [1, 2]),
(2, 'DA', [3, 4])],
['dept_id', 'dept_name', 'employees']
)
Script:
df_dept_exploded = df_dept.withColumn('id', F.explode('employees'))
df_joined = df_dept_exploded.join(df_emp, 'id', 'left')
df = (
df_joined
.groupBy('dept_name')
.agg(
F.collect_list('id').alias('employees'),
F.collect_list('Name').alias('employee_names')
)
)
df.show()
# --------- --------- --------------
# |dept_name|employees|employee_names|
# --------- --------- --------------
# | DE| [1, 2]| [aaa, bbb]|
# | DA| [3, 4]| [ccc, ddd]|
# --------- --------- --------------