I am using spark SQL and python to perform this. I have Data like this:
| date | col1| col2 |
|01011245| ABE| ATL|
|01020600| ABE| DTW|
|01021245| DTW| ATL|
|01020605| ABE| ATL|
I want to filter the data such that we will show only rows containing values in col2 that exist somewhere in col1 (not necessary in the same row).
for example, in our small example it should be:
| date | col1| col2 |
|01020600| ABE| DTW|
I think maybe a for loop with a condition on each row value should work, But I didn't succeed with it.
CodePudding user response:
Very, VERY rarely to you need to use a loop with any form of SQL database - it is set based.
You can use a sub-query in the where clause to filter e.g.
SELECT [date],col1, col2
FROM @demo
WHERE col2 IN ( SELECT DISTINCT col1 FROM @demo)
or you can use a self-join e.g.
SELECT t1.[date],t1.col1, t1.col2
FROM @demo t1
INNER JOIN @demo t2 ON t1.col2 = t2.col1
CodePudding user response:
It would be more efficient to use EXISTS
instead of IN
as the EXISTS
keyword evaluates TRUE
or FALSE
, but the IN
keyword will compare all values in the corresponding subuery column. If you are using the IN
operator, the SQL engine will scan all records fetched from the inner query. On the other hand, if we are using EXISTS
, the SQL engine will stop the scanning process as soon as it found a match.
Query with EXISTS
looks like
SELECT [date],col1, col2
FROM @demo d
WHERE EXISTS (
SELECT 1 FROM @demo t
WHERE t.col1 = d.col2
)