I have a spark dataframe similar to this:
... | Rules | ... |
---|---|---|
... | [{"id": 1,...}, {"id": 2}] | ... |
... | [{"id": 5,...}] | ... |
Rules
is an array column, where each element has the id
field.
And I want to filter all rows which contain a rule with id
< 3. Is it possible to do this without UDFs? My dataframe is very large and UDFs impair the perofrmace of my query.
CodePudding user response:
Extract map values from the array, find the max or(min depending on what you want) value in the array and check it is less than 3. That returns a boolean. You then can filter using the where clause.
----------------------
|Rules |
----------------------
|[{id -> 1}, {id -> 2}]|
|[{id -> 5}] |
----------------------
If you want to filter only those rows with id less than 3, use;
df.where(expr("array_max(transform(Rules, x-> map_values(x)[0]))<3")).show(truncate=False)
If you want to filter rows with any id less than 3, use;
df.where(expr("array_min(transform(Rules, x-> map_values(x)[0]))<3")).show(truncate=False)
----------------------
|Rules |
----------------------
|[{id -> 1}, {id -> 2}]|
----------------------
CodePudding user response:
You can use the EXISTS
available on higher-order functions
# Given dataset
----------------------
|Rules |
----------------------
|[{id -> 1}] |
|[{id -> 1}, {id -> 2}]|
|[{id -> 5}] |
----------------------
import pyspark.sql.functions as f
df_filtered_rules = df.where(~f.expr("EXISTS(Rules, rule -> rule.id >= 3)"))
df_filtered_rules.show(truncate=False)
----------------------
|Rules |
----------------------
|[{id -> 1}] |
|[{id -> 1}, {id -> 2}]|
----------------------