Home > Enterprise >  array_contains() on array of structs Column is not Iterable
array_contains() on array of structs Column is not Iterable

Time:07-13

I have 2 columns that has this schema:

root
 |-- parent_column: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- item_1: integer (nullable = true)
 |    |    |-- item_2: long (nullable = true)
 |    |    |-- item_3: integer (nullable = true)
 |    |    |-- item_4: boolean (nullable = true)
 |-- child_column: struct (nullable = false)
 |    |-- item_1: integer (nullable = true)
 |    |-- item_2: long (nullable = true)
 |    |-- item_3: integer (nullable = true)
 |    |-- item_4: boolean (nullable = false)

I wanted to check if the child_column exists in the parent_column by doing array_contains(F.col('parent_column'), F.col('child_column')) but I am running into Column is not Iterable error.

Sample data:

 ---------------------------------------------- -------------------------------------------- -------------- 
|parent_column                                 | child_column                               | data_check   |
 ---------------------------------------------- -------------------------------------------- -------------- 
|[[1, 2, 3, 4, false]]                         | [1, 2, 3, 4, false]                        |    true      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
|[[1, 2, 3, 4, false]]                         | [6, 7, 8, 9, false]                        |   false      |
 ---------------------------------------------- -------------------------------------------- -------------- 

Sample runnable code:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [([(2, 2, 2,)],)],
    'parent_column:array<struct<item_1:bigint,item_2:bigint,item_3:bigint>>'
)

df = df.withColumn(
    'child_column',
    F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")
)

# WITH ERRORS
# df = df.withColumn(
#     'contains',
#     F.array_contains(F.col('parent_column'), F.col('child_column'))
# )

df.show(truncate=False)

In my mind I am checking if a struct exists in an array of structs. So I am not sure why I am getting this error. Any tips?

CodePudding user response:

Seems like you sample data is off. I fixed it. See child column definition. Not sure if this is your problem with the original query.

>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame(
...     [([(2, 2, 2,)],)],
...     'parent_column:array<struct<item_1:bigint,item_2:bigint,item_3:bigint>>'
... )
>>> 
>>> df = df.withColumn(
...     'child_column',
...     F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")
... )
>>> df = df.withColumn(
...     'child_column',
...     F.expr("transform(parent_column, x -> struct(x.item_1 as item_1, x.item_2 as item_2, x.item_3 as item_3))")[0])
>>> df.withColumn( 'contains',expr(" array_contains(parent_column, child_column )" )).show()
 ------------- ------------ -------- 
|parent_column|child_column|contains|
 ------------- ------------ -------- 
|  [[2, 2, 2]]|   [2, 2, 2]|    true|
 ------------- ------------ -------- 
  • Related