How to filter rows if values in col2 dose not exist in col1-CodePudding

I am using spark SQL and python to perform this. I have Data like this:

|  date  | col1| col2 | 

|01011245|  ABE|   ATL|
|01020600|  ABE|   DTW|
|01021245|  DTW|   ATL|
|01020605|  ABE|   ATL|

I want to filter the data such that we will show only rows containing values in col2 that exist somewhere in col1 (not necessary in the same row).

for example, in our small example it should be:

|  date  | col1| col2 | 

|01020600|  ABE|   DTW|

I think maybe a for loop with a condition on each row value should work, But I didn't succeed with it.

CodePudding user response：

Very, VERY rarely to you need to use a loop with any form of SQL database - it is set based.

You can use a sub-query in the where clause to filter e.g.

SELECT [date],col1, col2
FROM @demo 
WHERE col2 IN ( SELECT DISTINCT col1 FROM @demo)

or you can use a self-join e.g.

SELECT t1.[date],t1.col1, t1.col2
FROM @demo t1
INNER JOIN @demo t2 ON t1.col2 = t2.col1

CodePudding user response：

It would be more efficient to use EXISTS instead of IN as the EXISTS keyword evaluates TRUE or FALSE, but the IN keyword will compare all values in the corresponding subuery column. If you are using the IN operator, the SQL engine will scan all records fetched from the inner query. On the other hand, if we are using EXISTS, the SQL engine will stop the scanning process as soon as it found a match.

Query with EXISTS looks like

SELECT [date],col1, col2
FROM @demo d
WHERE EXISTS ( 
    SELECT 1 FROM @demo t
    WHERE t.col1 = d.col2
)