Filter all rows based on existance of element in array column-CodePudding

I have a spark dataframe similar to this:

...	Rules	...
...	[{"id": 1,...}, {"id": 2}]	...
...	[{"id": 5,...}]	...

Rules is an array column, where each element has the id field.

And I want to filter all rows which contain a rule with id < 3. Is it possible to do this without UDFs? My dataframe is very large and UDFs impair the perofrmace of my query.

CodePudding user response：

Extract map values from the array, find the max or(min depending on what you want) value in the array and check it is less than 3. That returns a boolean. You then can filter using the where clause.

 ---------------------- 
|Rules                 |
 ---------------------- 
|[{id -> 1}, {id -> 2}]|
|[{id -> 5}]           |
 ----------------------

If you want to filter only those rows with id less than 3, use;

df.where(expr("array_max(transform(Rules, x-> map_values(x)[0]))<3")).show(truncate=False)

If you want to filter rows with any id less than 3, use;

df.where(expr("array_min(transform(Rules, x-> map_values(x)[0]))<3")).show(truncate=False)

 ---------------------- 
|Rules                 |
 ---------------------- 
|[{id -> 1}, {id -> 2}]|
 ----------------------

CodePudding user response：

You can use the EXISTS available on higher-order functions

# Given dataset
 ---------------------- 
|Rules                 |
 ---------------------- 
|[{id -> 1}]           |
|[{id -> 1}, {id -> 2}]|
|[{id -> 5}]           |
 ---------------------- 

import pyspark.sql.functions as f

df_filtered_rules = df.where(~f.expr("EXISTS(Rules, rule -> rule.id >= 3)"))
df_filtered_rules.show(truncate=False)

 ---------------------- 
|Rules                 |
 ---------------------- 
|[{id -> 1}]           |
|[{id -> 1}, {id -> 2}]|
 ----------------------