Home > Back-end >  comparing two sources columns values in bigquery
comparing two sources columns values in bigquery

Time:10-27

I have a data set as following. To keep it simple I am showing only one column i.e qty but there are multiples for example (dates, prices etc). if any of the value within a single row is a mismatch then that row should show up as output along with the values from both sources, for example:

source 1

id  qty
1   100
2   null
3   0
4   50

source 2

id_ qty_
1   100
2   80
3   100

expected output:

id  qty       id_   qty_   
2   null      2     80
3   0         3     100
4   50        null  null

What i am trying to achieve is to fetch all the rows from source 2 that doesn't match with the source 1. Source 1 is taken as main source/table.

In the end the output should show only row 2,3,4 with both values so that i can build a boolean saying:

if (qty != qty_ , false, true) as is_qty_matching

An example with multiple values:

source 1

id  qty  price 
1   100   10
2   null  0
3   0     0
4   50    0
5   100   30

source 2

id_ qty_ price 
1   100  10
2   80   0
3   100  0
5   100  33

expected output:

    id  qty   price     id_     qty_   price_
    2   null    0         2     80     0
    3   0       0         3     100    0
    4   50      0         null  null   0
    5   100    30         5     100    33

In this case now id 5 has price mismatch

so my two new columns (is_price_matching) would show false for id 5 while true for other id', while the column (is_qty_matching) will show true for id 5 and false for others.

Similarly i will have multiple other columns and therefore multiple other boolean fields for each of the mismatch entry for example is_date_matching, is_time_matching etc.

what's important is that:

  • The values can be null in either of source for any of the column, this might mess up the boolean maybe
  • only the rows that has even one single mismatch value should show up in the final output, if there is no mismatch in any of the column values it should not show up

I have tried the following query:

with main as (
select 1 as id , 100 as qty , 50 as price
union all
select 2 as id , 0 as qty , 100 as price
union all
select 3 as id , 0 as qty ,80 as price
union all 
select 4 as id , 50 as qty , 90 as price
union all
select 5 as id , 20 as qty , 100 as price
union all
select 6 as id , 20 as qty , 100 as price
),
main2 as (
select 1 as id_, 100 as qty_ , 50 as price_
union all
select 2 as id_, 80 as qty_ , 100 as price_
union all
select 3 as id_, 100 as qty_ , 80 as price_
union all
select 5 as id_, 20 as qty_ , 100 as price_
union all
select 6 as id_, 20 as qty_ , 40 as price_
)
select 
main.*,
main2.*
from main 
left join main2 
on (main.id = main2.id_) 
WHERE coalesce(qty,0) != coalesce(qty_,0)
or coalesce(main.price,0) != coalesce(main2.price_,0)

it seems to work but i was wondering if there is a better solution ? plus if i add the following line of code

if(qty != qty_ , true , false) as is_qty_mismatch

it will return incorrect output where the values are null

CodePudding user response:

There are 2 things that your query doesn't account for.

  1. The second table having ID that's not present in 1st table
  2. You can't compare null to other values (it gives you null).

To fix the best problem changing join from left to full should work (row 31), for the second wrapping your logic with ifnull function. Also if the output for if functions are boolean values, than you can skip the function and keep just logical statement (rows 28 and 29)

with main as (
  select 1 as id , 100 as qty , 50 as price
  union all
  select 2 as id , 0 as qty , 100 as price
  union all
  select 3 as id , 0 as qty ,80 as price
  union all 
  select 4 as id , 50 as qty , 90 as price
  union all
  select 5 as id , 20 as qty , 100 as price
  union all
  select 6 as id , 20 as qty , 100 as price
),
main2 as (
  select 1 as id_, 100 as qty_ , 50 as price_
  union all
  select 2 as id_, 80 as qty_ , 100 as price_
  union all
  select 3 as id_, 100 as qty_ , 80 as price_
  union all
  select 5 as id_, 20 as qty_ , 100 as price_
  union all
  select 6 as id_, 20 as qty_ , 40 as price_
)
select 
  main.*,
  main2.*, 
  ifnull(qty != qty_ , true) as is_qty_mismatch, 
  ifnull(price != price_ , true) as is_qty_mismatch, 
from main 
full join main2 
  on (main.id = main2.id_) 
where coalesce(qty,0) != coalesce(qty_,0)
  or coalesce(main.price,0) != coalesce(main2.price_,0)

CodePudding user response:

Below might be an another option,

SELECT * FROM (
  SELECT * FROM (SELECT *, '1' src FROM main EXCEPT DISTINCT SELECT *, '1' src FROM main2)
   UNION ALL
  SELECT * FROM (SELECT *, '2' src FROM main2 EXCEPT DISTINCT SELECT *, '2' src FROM main) 
) PIVOT (ANY_VALUE(STRUCT(qty, price)) main FOR src IN ('1', '2'));

enter image description here

The EXCEPT operator returns rows from the left input query that are not present in the right input query.

Key is how to remove duplicated rows in both tables, you can consider below approach as well using an analytic function.

SELECT * FROM (
  SELECT * FROM (
    SELECT *, '1' src FROM main UNION ALL SELECT *, '2' FROM main2
  ) QUALIFY COUNT(1) OVER (PARTITION BY id, qty, price) = 1
) PIVOT (ANY_VALUE(STRUCT(qty, price)) main FOR src IN ('1', '2'));
-- output will be same as above
  • Related