Bigquery: Joining 2 tables one having repeated records and one with count ()-CodePudding

I want to join tables after unnest arrays in Table:1 but the records duplicated after the join because of the unnest.

Table:1

| a | d.b | d.c |
-----------------
| 1 |  5  |  2  |
-   -------------
|   |  3  |  1  |
-----------------
| 2 |  2  |  1  |

Table:2

| a |  c  |  f  |
-----------------
| 1 |  12 | 13  |
-----------------
| 2 | 14  | 15  |

I want to join table 1 and 2 on a but I need also to have the output of:

| a | d.b | d.c |  f  |  h  |  Sum(count(a))
---------------------------------------------
| 1 |  5  |  2  | 13  | 12  |
-   -------------     -     -       1
|   |  3  |  1  |     |     |
---------------------------------------------
| 2 |  2  |  1  | 15  | 14  |       1

a can be repeated in table 2 for that I need to count(a) then select the sum after join.

My problem is when I'm joining I need the nested and repeated record to be the same as in the first table but when use aggregation to get the sum I can't group by struct or arrays so I UNNEST the records first then use ARRAY_AGG function but also there was an issue in the sum.

SELECT
 t1.a,
 t2.f,
 t2.h,
 ARRAY_AGG(DISTINCT(t1.db)) as db,
 ARRAY_AGG(DISTINCT(t1.dc)) as dc,
 SUM(t2.total) AS total
FROM (
  SELECT
    a,
    d.b as db,
    d.c as dc
  FROM
    `table1`,
    UNNEST(d) AS d,
  ) AS t1
LEFT JOIN (
  SELECT
    a,
    f,
    h,
    COUNT(*) AS total,
  FROM
    `table2`
  GROUP BY
    a,f,h) AS t2
ON
  t1.a = t2.a
GROUP BY
  1,
  2,
  3

Note: the error is in the total number after the sum it is much higher than expected all other data are correct.

CodePudding user response：

I guess your table 2 contains is not unique for column a.

Lets assume that the table 2 looks like this:

a	c	f
1	12	13
2	14	15
1	100	101

There are two rows where a is 1. Since b and f are different, the grouping does not solve this ( GROUP BY a,f,h) AS t2) and counts(*) as total is one for each row.

a	c	f	total
1	12	13	1
2	14	15	1
1	100	101	1

In the next step you join this table to your table 1. The rows of table1 with value 1 in column a are duplicated, because table2 has two entries. This lead to the fact that the sum is too high.

Instead of unnesting the tables, I recommend following approach:

-- Creating of sample data as given:
with tbl_A as (select  1 a, [struct(5 as b,2 as c),struct(3,1)] d union all select 2,[struct(2,1)] union all select null,[struct(50,51)]),
tbl_B as (select 1 as a,12 b, 13 f union all select 2,14,15 union all select 1,100,101 union all select null,500,501)

-- Query:

select *
from tbl_A A
left join
(Select a,array_agg(struct(b,f)) as B, count(1) as counts from tbl_B group by 1) B
on ifnull(A.a,-9)=ifnull(B.a,-9)