I have a delta table A as shown below.
point | cluster | points_in_cluster |
---|---|---|
37 | 1 | [37,32] |
45 | 2 | [45,67,84] |
67 | 2 | [45,67,84] |
84 | 2 | [45,67,84] |
32 | 1 | [37,32] |
Also I have a table B as shown below.
id | point |
---|---|
101 | 37 |
102 | 67 |
103 | 84 |
I want a query like the following. Here in
obviously doesn't work for a list. So, what would be the right syntax?
select b.id, a.point
from A a, B b
where b.point in a.points_in_cluster
As a result I should have a table like the following
id | point |
---|---|
101 | 37 |
101 | 32 |
102 | 45 |
102 | 67 |
102 | 84 |
103 | 45 |
103 | 67 |
103 | 84 |
CodePudding user response:
Based on your data sample, I'd do an equi-join on point
column and then an explode
on points_in_cluster
:
from pyspark.sql import functions as F
# assuming A is df_A and B is df_B
df_A.join(
df_B,
on="point"
).select(
"id",
F.explode("points_in_cluster").alias("point")
)
Otherwise, you use array_contains
:
select b.id, a.point
from A a, B b
where array_contains(a.points_in_cluster, b.point)