Home > Back-end >  Filter all rows based on existance of element in array column
Filter all rows based on existance of element in array column

Time:05-02

I have a spark dataframe similar to this:

... Rules ...
... [{"id": 1,...}, {"id": 2}] ...
... [{"id": 5,...}] ...

Rules is an array column, where each element has the id field.

And I want to filter all rows which contain a rule with id < 3. Is it possible to do this without UDFs? My dataframe is very large and UDFs impair the perofrmace of my query.

CodePudding user response:

Extract map values from the array, find the max or(min depending on what you want) value in the array and check it is less than 3. That returns a boolean. You then can filter using the where clause.

 ---------------------- 
|Rules                 |
 ---------------------- 
|[{id -> 1}, {id -> 2}]|
|[{id -> 5}]           |
 ---------------------- 

If you want to filter only those rows with id less than 3, use;

df.where(expr("array_max(transform(Rules, x-> map_values(x)[0]))<3")).show(truncate=False)

If you want to filter rows with any id less than 3, use;

df.where(expr("array_min(transform(Rules, x-> map_values(x)[0]))<3")).show(truncate=False)

 ---------------------- 
|Rules                 |
 ---------------------- 
|[{id -> 1}, {id -> 2}]|
 ---------------------- 

CodePudding user response:

You can use the EXISTS available on higher-order functions

# Given dataset
 ---------------------- 
|Rules                 |
 ---------------------- 
|[{id -> 1}]           |
|[{id -> 1}, {id -> 2}]|
|[{id -> 5}]           |
 ---------------------- 

import pyspark.sql.functions as f

df_filtered_rules = df.where(~f.expr("EXISTS(Rules, rule -> rule.id >= 3)"))
df_filtered_rules.show(truncate=False)

 ---------------------- 
|Rules                 |
 ---------------------- 
|[{id -> 1}]           |
|[{id -> 1}, {id -> 2}]|
 ---------------------- 
  • Related