Here are two columns of my dataframe (df):
A | B |
---|---|
["a"] | [["a"], ["b"]] |
["c"] | [["a"], ["b"]] |
I want to create an array that tells whether the array in column A is in the array of array which is in column B, like this:
A | B | C |
---|---|---|
["a"] | [["a"], ["b"]] | True |
["c"] | [["a"], ["b"]] | False |
I tried df.select("A", "B", array_contains(F.col("B"), F.col("A")).alias("C"))
but got an error:
org.apache.spark.SparkRuntimeException: The feature is not supported: literal for '' of class java.util.ArrayList
It seems that array of array isn't implemented in PySpark. Is there a workaround?
CodePudding user response:
You can use exists
and transform
functions in SQL expr
.
df = df.withColumn('C', F.expr('exists(transform(B, x -> x == A), x -> x is true)'))
CodePudding user response:
You can check it using arrays_overlap
after flatten
is used on the column "B".
F.arrays_overlap("A", F.flatten("B"))
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(["a"], [["a"], ["b"]]),
(["c"], [["a"], ["b"]])],
['A', 'B'])
df = df.withColumn("C", F.arrays_overlap("A", F.flatten("B")))
df.show()
# --- ---------- -----
# | A| B| C|
# --- ---------- -----
# |[a]|[[a], [b]]| true|
# |[c]|[[a], [b]]|false|
# --- ---------- -----