I am lost and trying to figure out the reason I am getting different outputs for the same logic. I want to find out all calls (variants and non-variants) for each call set, and omits any call with a non-PASS filter.
The query is given at the link:
#standardSQL
SELECT
call.name AS call_name,
COUNT(1) AS number_of_calls
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
NOT EXISTS (SELECT 1 FROM UNNEST(call.FILTER) AS call_filter WHERE call_filter != 'PASS')
GROUP BY
call_name
ORDER BY
call_name
Row | call_name | number_of_calls |
---|---|---|
1 | NA12877 | 29795946 |
2 | NA12878 | 26118774 |
3 | NA12889 | 29044992 |
4 | NA12890 | 28717437 |
5 | NA12891 | 31395995 |
6 | NA12892 | 25349974 |
This returns the number of rows that have filter as PASS.
But, when I try a similar logic I get a different result.
#standardSQL
SELECT
call.name AS call_name,
COUNT(1) AS number_of_calls
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, v.call
WHERE
EXISTS (SELECT 1 FROM UNNEST(call.FILTER) AS call_filter WHERE call_filter = 'PASS')
GROUP BY
call_name
ORDER BY
call_name
Row | call_name | number_of_calls |
---|---|---|
1 | NA12877 | 4488086 |
2 | NA12878 | 4503443 |
3 | NA12889 | 4423974 |
4 | NA12890 | 4529950 |
5 | NA12891 | 4425316 |
6 | NA12892 | 4497085 |
Why is this? Is my logic wrong? Can someone explain why empty array is a "PASS" in this case? Thank you for taking the time to help me!
CodePudding user response:
where NOT EXISTS (SELECT 1 FROM UNNEST(call.FILTER) AS call_filter WHERE call_filter != 'PASS')
and
where EXISTS (SELECT 1 FROM UNNEST(call.FILTER) AS call_filter WHERE call_filter = 'PASS')
Are not the same. First one just ignores columns with all filters not pass, means resuts will be "all passed" "some passed".
And second one just returns "all passed" results.
CodePudding user response:
Thank you @shawnt00 for providing the solution.
I will demonstrate with a few smaller examples that the extra counts are indeed empty arrays.
Query 1
with smalltable as (
select ["PASS"] as filter, 'NA12877' as name union all
select ["PASS"], 'NA12877' union all
select ["PASS"], 'NA12879' union all
select ["PASS",'RefCall'], 'NA12877' union all
select ["RefCall"], 'NA12877' union all
select ["RefCall"], 'NA12877' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879', union all
select [], 'NA12879'
)
select name, count(1) as count
from smalltable
where not exists (select 1 from unnest(filter) as f where f != 'PASS')
group by name
Output:
Row | name | count |
---|---|---|
1 | NA12877 | 2 |
2 | NA12879 | 2 |
As we can see the query considers [] as 'PASS'. I am not sure why?
Query 2
with smalltable as (
select ["PASS"] as filter, 'NA12877' as name union all
select ["PASS"], 'NA12877' union all
select ["PASS"], 'NA12879' union all
select ["PASS",'RefCall'], 'NA12877' union all
select ["RefCall"], 'NA12877' union all
select ["RefCall"], 'NA12877' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879' union all
select ["RefCall"], 'NA12879', union all
select [], 'NA12879'
)
select name, count(1) as count
from smalltable
where exists (select 1 from unnest(filter) as f where f = 'PASS')
group by name
Output:
Row | name | count |
---|---|---|
1 | NA12877 | 3 |
2 | NA12879 | 1 |
This as expected does not count the empty array.