Apply logical operator on a list in pyspark-CodePudding

I have to apply the logical operator or on a list of conditions in the where function in pyspark. As in pyspark the operators for or is |, it is not able to use the any() function from Python. Has anybody a suggestion how to solve this?

Below a simple example:

# List of conditions
spark_conditions = [cond1, cond2, ..., cond100]

# Apply somehow the '|' operator on `spark_conditions`
# spark_conditions would look like -> [cond1 | cond2 | .... | cond100]

df.select(columns).where(spark_conditions)

I appreciate any help, thank you!

CodePudding user response：

I think this is actually a pandas question, as spark.sql.DataFrame seems at least to behave like a pandas DataFrame. But I don't know spark. In any case, your 'spark conditions' are actually (I think) boolean series. I'm sure there is some way properly to sum boolean series in pandas, but you can also just reduce it like this:

import pandas as pd
from funtools import reduce

df = pd.DataFrame([0,1,2,2,1,4], columns=["num"])
filter1 = df["num"] > 3
filter2 = df["num"] == 2
filter3 = df["num"] == 1
filters = (filter1, filter2, filter3)
filter = reduce(lambda x, y: x | y, filters)
df.filter(filter) # note .where is an alias for .filter

This works like this: reduce() takes the first two things in filters and runs lambda x, y: x | y on them. Then it takes the output of that, and passes it as x to lambda x, y: x | y, taking the third entry in filters and passing it as y. It keeps going until it hasn't got anything left to to take.

So the net effect is to apply a function cumulatively along an iterable. In this case the function just returns | of its inputs, so it does exactly what you would do manually, but like this:

(filter1 | filter2) | filter3

I suspect there's a more pandasy or sparky way of doing that, but reduce is sometimes worth having. Guido doesn't like it though.

CodePudding user response：

2e0byo's answer is pretty much correct. I am adding another way how this can be done in pyspark.

If our conditions are strings of SQL conditional expressions (like col_1 == 'ABC101'), we can combine all those strings and feed that combined string as a condition to where() (or filter()).

df = spark.createDataFrame([(1, "a"),
                            (2, "b"),
                            (3, "c"),
                            (4, "d"),
                            (5, "e"),
                            (6, "f"),
                            (7, "g")], schema="id int, name string")
condition1 = "id == 1"
condition2 = "id == 4"
condition3 = "id == 6"
conditions = [condition1, condition2, condition3]
combined_or_condition = " or ".join(conditions)     # Combine the conditions: condition1 or condition2 or condition3
df.where(combined_or_condition).show()

" or ".join(conditions) creates a string by joining all the strings present in conditions using or as the delimiter/joiner/combiner. Here, combined_or_condition becomes id == 1 or id == 4 or id == 6.

CodePudding user response：

I think you can use the summation of the list.

>>> a = True
>>> b = False
>>> a b
1
>>> c = False
>>> c b
0

then check the summation.