I have to apply the logical operator or
on a list of conditions in the where
function in pyspark. As in pyspark the operators for or
is |
, it is not able to use the any()
function from Python. Has anybody a suggestion how to solve this?
Below a simple example:
# List of conditions
spark_conditions = [cond1, cond2, ..., cond100]
# Apply somehow the '|' operator on `spark_conditions`
# spark_conditions would look like -> [cond1 | cond2 | .... | cond100]
df.select(columns).where(spark_conditions)
I appreciate any help, thank you!
CodePudding user response:
I think this is actually a pandas question, as spark.sql.DataFrame
seems at least to behave like a pandas DataFrame. But I don't know spark. In any case, your 'spark conditions' are actually (I think) boolean series. I'm sure there is some way properly to sum boolean series in pandas, but you can also just reduce it like this:
import pandas as pd
from funtools import reduce
df = pd.DataFrame([0,1,2,2,1,4], columns=["num"])
filter1 = df["num"] > 3
filter2 = df["num"] == 2
filter3 = df["num"] == 1
filters = (filter1, filter2, filter3)
filter = reduce(lambda x, y: x | y, filters)
df.filter(filter) # note .where is an alias for .filter
This works like this: reduce()
takes the first two things in filters and runs lambda x, y: x | y
on them. Then it takes the output of that, and passes it as x
to lambda x, y: x | y
, taking the third entry in filters
and passing it as y
. It keeps going until it hasn't got anything left to to take.
So the net effect is to apply a function cumulatively along an iterable. In this case the function just returns |
of its inputs, so it does exactly what you would do manually, but like this:
(filter1 | filter2) | filter3
I suspect there's a more pandasy or sparky way of doing that, but reduce is sometimes worth having. Guido doesn't like it though.
CodePudding user response:
2e0byo's answer is pretty much correct. I am adding another way how this can be done in pyspark.
If our conditions are strings of SQL conditional expressions (like col_1 == 'ABC101'), we can combine all those strings and feed that combined string as a condition to where()
(or filter()
).
df = spark.createDataFrame([(1, "a"),
(2, "b"),
(3, "c"),
(4, "d"),
(5, "e"),
(6, "f"),
(7, "g")], schema="id int, name string")
condition1 = "id == 1"
condition2 = "id == 4"
condition3 = "id == 6"
conditions = [condition1, condition2, condition3]
combined_or_condition = " or ".join(conditions) # Combine the conditions: condition1 or condition2 or condition3
df.where(combined_or_condition).show()
" or ".join(conditions)
creates a string by joining all the strings present in conditions
using or
as the delimiter/joiner/combiner. Here, combined_or_condition
becomes id == 1 or id == 4 or id == 6
.
CodePudding user response:
I think you can use the summation of the list.
>>> a = True
>>> b = False
>>> a b
1
>>> c = False
>>> c b
0
then check the summation.