I have 2 columns that has this schema:
root
|-- parent_column: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- item_1: integer (nullable = true)
| | |-- item_2: long (nullable = true)
| | |-- item_3: integer (nullable = true)
| | |-- item_4: boolean (nullable = true)
|-- child_column: struct (nullable = false)
| |-- item_1: integer (nullable = true)
| |-- item_2: long (nullable = true)
| |-- item_3: integer (nullable = true)
| |-- item_4: boolean (nullable = false)
I wanted to check if the child_column
exists in the parent_column
by doing array_contains(F.col('parent_column'), F.col('child_column'))
but I am running into Column is not Iterable
error.
Sample data:
---------------------------------------------- -------------------------------------------- --------------
|parent_column | child_column | data_check |
---------------------------------------------- -------------------------------------------- --------------
|[[1, 2, 3, 4, false]] | [1, 2, 3, 4, false] | true |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
|[[1, 2, 3, 4, false]] | [6, 7, 8, 9, false] | false |
---------------------------------------------- -------------------------------------------- --------------
Sample runnable code:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[([(2, 2, 2,)],)],
'parent_column:array<struct<item_1:bigint,item_2:bigint,item_3:bigint>>'
)
df = df.withColumn(
'child_column',
F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")
)
# WITH ERRORS
# df = df.withColumn(
# 'contains',
# F.array_contains(F.col('parent_column'), F.col('child_column'))
# )
df.show(truncate=False)
In my mind I am checking if a struct exists in an array of structs. So I am not sure why I am getting this error. Any tips?
CodePudding user response:
Seems like you sample data is off. I fixed it. See child column definition. Not sure if this is your problem with the original query.
>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame(
... [([(2, 2, 2,)],)],
... 'parent_column:array<struct<item_1:bigint,item_2:bigint,item_3:bigint>>'
... )
>>>
>>> df = df.withColumn(
... 'child_column',
... F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")
... )
>>> df = df.withColumn(
... 'child_column',
... F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")[0])
>>> df.withColumn( 'contains',expr(" array_contains(parent_column, child_column )" )).show()
------------- ------------ --------
|parent_column|child_column|contains|
------------- ------------ --------
| [[2, 2, 2]]| [2, 2, 2]| true|
------------- ------------ --------