I have a Spark dataframe with ArrayType columnn:
|id|neighbors|
-- ---------
|a |[b,c] |
|b |[a,d] |
|c |[a] |
|d |[b] |
-- ---------
I need to map values in this ArrayType column with initial dataframe.
Desired output:
|id|neighbors |
-- -------------
|a |[[a,d],[a]] |
|b |[[b,c],[b]] |
|c |[[b,c]] |
|d |[[a,d]] |
-- -------------
What is the best way to handle this problem? I have very large amount of data (about 100 million records).
CodePudding user response:
You would need to explode
the column 'neighbors' and then just join
. Also, since this is a self join, it is recommended to use alias
on dataframes.
Initial df:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('a', ['b', 'c']),
('b', ['a', 'd']),
('c', ['a']),
('d', ['b'])],
['id', 'neighbors']
)
Transformation:
df = (
df.withColumn('_neighbors', F.explode('neighbors')).alias('df1')
.join(df.alias('df2'), F.col('df1._neighbors') == F.col('df2.id'))
.groupBy('df1.id')
.agg(F.collect_list('df2.neighbors').alias('neighbors'))
)
df.show()
# --- -------------
# | id| neighbors|
# --- -------------
# | d| [[a, d]]|
# | c| [[b, c]]|
# | b|[[b, c], [b]]|
# | a|[[a, d], [a]]|
# --- -------------