Home > Blockchain >  How use two columns as a single condition to get results in pyspark
How use two columns as a single condition to get results in pyspark

Time:06-27

I have:

 ----------- ------ 
|ColA       |ColB  |
 ----------- ------ 
|       A   |     B|
|       A   |     D|
|       C   |     U|
|       B   |     B|
|       A   |     B|
 ----------- ------ 

and I want to get:

 ----------- ------ 
|ColA       |ColB  |
 ----------- ------ 
|       A   |     D|
|       C   |     U|
|       B   |     B|
 ----------- ------ 

I want to "remove" all rows with the combination of "colA == A and colB == B". When I tried this SQL Statement

SELECT * FROM table where (colA != 'A' and colB != 'B')

worked fine.

But when I try to translate to spark (or even to pandas) I got an error.

Py4JError: An error occurred while calling o109.and. Trace:...

#spark
sparkDF.where((sparkDF['colA'] != 'A' & sparkDF['colB'] != 'B')).show() 

#pandas
pandasDF[(pandasDF["colA"]!="A" & pandasDF["colB"]!="B")]

What am I doing wrong here?

CodePudding user response:

Need add parentheses and | for bitwise OR:

pandasDF[(pandasDF["colA"]!="A") | (pandasDF["colB"]!="B")]

sparkDF.where((sparkDF['colA'] != 'A') | (sparkDF['colB'] != 'B')).show() 
  • Related