Home > Net >  Pyspark AND/ALSO Partition Column Query
Pyspark AND/ALSO Partition Column Query

Time:12-25

How do you perform a AND/ALSO query in pyspark? I want both conditions to be met for results to be filtered.

Original dataframe:

df.count()
4105

The first condition does not find any records:

df.filter((df.created_date != 'add')).count()
4105

Therefore I would expect the AND clause here to return 4105, but instead it continues to filter on df.lastUpdatedDate:

df.filter((df.created_date != 'add') & (df.lastUpdatedDate != '2022-12-21')).count()
3861

To me 3861 is the result of an OR clause. How do I address this? lastUpdatedDate is a partition filter based on .explain() so maybe that has something to do with these results?

...PartitionFilters: [isnotnull(lastUpdatedDate#26), NOT (lastUpdatedDate#26 = 2022-12-21)], PushedFilters: [IsNotNull(created_date), Not(EqualTo(created_date,add))], ReadSchema ...

CodePudding user response:

In PySpark, you can use the where function to filter data from a DataFrame based on a logical expression. If you want to filter data based on multiple partitioning columns, you can use the AND or ALSO keyword to combine multiple filter expressions.

For example, suppose you have a DataFrame called df with partitioning columns year, month, and day. You can filter the data to select only rows where year equals 2020 and month equals 1 as follows:

filtered_df = df.where((df.year == 2020) & (df.month == 1)).count()

Or you can use the ALSO keyword as follows:

filtered_df = df.where((df.year == 2020) ALSO (df.month == 1)).count()

Note that it is important to use the parentheses to ensure that filter expressions evaluate correctly.

You can also use the OR keyword to combine filter expressions that must evaluate to True if at least one of the expressions evaluates to True. For example, to select rows where year equals 2020 or month equals 1, you can use the following code:

filtered_df = df.where((df.year == 2020) | (df.month == 1)).count()

CodePudding user response:

Going by our conversation in the comments - Your requirement is to filter out rows where (df.created_date != 'add') & (df.lastUpdatedDate != '2022-12-21')

Your confusion seems on the name of the method i.e. filter but rather consider it as where

Where or Filter by definition works the same as SQL where clause i.e. retains the rows where the expression returns true - and drops the rest.

i.e. Consider a dataframe

 --- ----- ------- 
| id| Name|Subject|
 --- ----- ------- 
|  1|  Sam|History|
|  2|  Amy|History|
|  3|Harry|  Maths|
|  4| Jake|Physics|
 --- ----- ------- 

The below filter would return a new Dataframe with only rows where the Subject is history, i.e. where the expression returned true (i.e. Filters out where it is false)

rightDF.filter(rightDF.col("Subject").equalTo("History")).show();

OR

rightDF.where(rightDF.col("Subject").equalTo("History")).show();

Output:

 --- ---- ------- 
| id|Name|Subject|
 --- ---- ------- 
|  1| Sam|History|
|  2| Amy|History|
 --- ---- ------- 

So in your case, you would want to negate the statements to get the results you desire.

i.e. Use equal to instead of not equal to

df.filter((df.created_date == 'add') & (df.lastUpdatedDate == '2022-12-21')).count()

This means keep rows where both statements are True and filter out where any of them is false

Let me know if this works or anything else

  • Related