Big query query is too complex after pivot-CodePudding

Assume I have the following table table and a list of interests (cat, dog, music, soccer, coding)

|   userId | user_interest  |  label   |
| -------- | -------------- |----------|
| 12345    | cat            |    1     |
| 12345    | dog            |    1     | 
| 6789     | music          |    1     |
| 6789     | soccer         |    1     |

I want to transform the user interest into a binary array (i.e. binarization), and the resulting table will be something like

|   userId | labels         |  
| -------- | -------------- |
| 12345    | [1,1,0,0,0]    |   
| 6789     | [0,0,1,1,0]    |

I am able to do it with PIVOT and ARRAY, e.g.

WITH user_interest_pivot AS (
      SELECT
        *
      FROM (
        SELECT userId, user_interest, label FROM table
          ) AS T
      PIVOT
      (
           MAX(label) FOR user_interestc IN  ('cat', 'dog', 'music', 'soccer', 'coding') 
 
      ) AS P
)

  SELECT
    userId,
    ARRAY[IFNULL(cat,0), IFNULL(dog,0), IFNULL(music,0), IFNULL(soccer,0), IFNULL(coding,0)] AS labels,
  FROM user_interea_pivot

HOWEVER, in reality I have a very long list of interests, and the above method in bigquery seems to not work due to

Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too comple

Please help to let me know if there is anything I can do to deal with this situation. Thanks!

CodePudding user response：

Still it's likely to face resource problem depending on your real data, but it is worth trying the following approach without PIVOT.

Create interests table with additional index column first

 ---------- ----- ----------------- 
| interest | idx | total_interests |
 ---------- ----- ----------------- 
| cat      |   0 |               5 |
| dog      |   1 |               5 |
| music    |   2 |               5 |
| soccer   |   3 |               5 |
| coding   |   4 |               5 |
 ---------- ----- -----------------

find idx of each user interest and aggreage them like below. (assuming that user intererest is sparse over overall interests)

    SELECT userId, ARRAY_AGG(idx) user_interests
      FROM sample_table t JOIN interests i ON t.user_interest = i.interest
     GROUP BY 1

Lastly, create labels vector using a sparse user interest array and dimension of interest space (i.e. total_interests) like below

       ARRAY(SELECT IF(ui IS NULL, 0, 1)
               FROM UNNEST(GENERATE_ARRAY(0, total_interests - 1)) i
               LEFT JOIN t.user_interests ui ON i = ui
              ORDER BY i
       ) AS labels

Query

CREATE TEMP TABLE sample_table AS
SELECT '12345' AS userId, 'cat' AS user_interest, 1 AS label UNION ALL
SELECT '12345' AS userId, 'dog' AS user_interest, 1 AS label UNION ALL
SELECT '6789' AS userId, 'music' AS user_interest, 1 AS label UNION ALL
SELECT '6789' AS userId, 'soccer' AS user_interest, 1 AS label;

CREATE TEMP TABLE interests AS 
  SELECT *, COUNT(1) OVER () AS total_interests 
    FROM UNNEST(['cat', 'dog', 'music', 'soccer', 'coding']) interest 
    WITH OFFSET idx
;

SELECT userId,
       ARRAY(SELECT IF(ui IS NULL, 0, 1)
               FROM UNNEST(GENERATE_ARRAY(0, total_interests - 1)) i
               LEFT JOIN t.user_interests ui ON i = ui
              ORDER BY i
       ) AS labels
  FROM (
    SELECT userId, total_interests, ARRAY_AGG(idx) user_interests
      FROM sample_table t JOIN interests i ON t.user_interest = i.interest
     GROUP BY 1, 2
  ) t;

Query results

CodePudding user response：

I think below approach will "survive" any [reasonable] data

create temp function base10to2(x float64) returns string 
language js as r'return x.toString(2);';
with your_table as (
  select '12345' as userid, 'cat' as user_interest, 1 as label union all
  select '12345' as userid, 'dog' as user_interest, 1 as label union all
  select '6789' as userid, 'music' as user_interest, 1 as label union all
  select '6789' as userid, 'soccer' as user_interest, 1 as label
), interests as ( 
  select *, pow(2, offset) weight, max(offset   1) over() as len
    from unnest(['cat', 'dog', 'music', 'soccer', 'coding']) user_interest 
    with offset
)
select userid, 
  split(rpad(reverse(base10to2(sum(weight))), any_value(len), '0'), '') labels, 
from your_table
join interests 
using(user_interest)
group by userid

with output