Home > Mobile >  How to pivot records to multi-hot encoding in BigQuery?
How to pivot records to multi-hot encoding in BigQuery?

Time:07-23

I have a records table:

CREATE TABLE `records` 
(
    `ID1` INT64,
    `ID2` DATE,
    `EVENT` INT64,
);

This table has some records like:

INSERT `records` (ID1, ID2, EVENT)
VALUES (1, PARSE_DATE('%Y-%m-%d', '2022-01-10'), 111),
       (1, PARSE_DATE('%Y-%m-%d', '2022-01-10'), 111),
       (1, PARSE_DATE('%Y-%m-%d', '2022-01-10'), 222),
       (1, PARSE_DATE('%Y-%m-%d', '2022-03-15'), 333),
       (1, PARSE_DATE('%Y-%m-%d', '2022-03-15'), 111),
       (1, PARSE_DATE('%Y-%m-%d', '2022-07-22'), 444),
       (2, PARSE_DATE('%Y-%m-%d', '2021-02-17'), 999),
       (2, PARSE_DATE('%Y-%m-%d', '2021-02-17'), 888),
       (2, PARSE_DATE('%Y-%m-%d', '2021-02-17'), 888),
       (2, PARSE_DATE('%Y-%m-%d', '2021-02-17'), 555),
       (2, PARSE_DATE('%Y-%m-%d', '2021-02-17'), 666),
       (2, PARSE_DATE('%Y-%m-%d', '2022-07-22'), 111)   

And I have a separate dictionary (or vocabulary) table:

CREATE TABLE `t_vocabs` 
(
    `EVENT` INT64,
    `DESCRIPTION` STRING,
);

Table t_vocabs contains all unique values for EVENT:

INSERT `t_vocabs` (EVENT, DESCRIPTION)
VALUES (111, 'desc1'),
       (222, 'desc2'),
       (333, 'desc3'),
       (444, 'desc4'),
       (555, 'desc5'),
       (666, 'desc6'),
       (777, 'desc7'),
       (888, 'desc8'),
       (999, 'desc9'),
       (123, 'desc10'),
       (456, 'desc111')

Therefore, records would look like this:

enter image description here

And the vocabulary table would look like this:

enter image description here

I need to convert records to multi-hot encoding in which for each ID1 and ID2 I have one record, with the number of columns equal to the unique number of events in t_vocabs and the values equal to 1 when that relevant event exists in records and 0 otherwise.

For this example, it'd be like:

enter image description here

Note, I have tried

SELECT 
    * 
FROM
    `records`
PIVOT
    (COUNT(*) FOR EVENT IN (111, 222, 333, 444, 555, 666, 777, 888, 999, 123, 456))

And this does the job, but the issue is that the t_vocabs is a big table and I can't list all EVENT values manually in this query. When I try to select events from another table (e.g. following query) it doesn't like the select:

SELECT * 
FROM `records`
PIVOT(COUNT(*) FOR EVENT IN (SELECT EVENT FROM `t_vocabs`))

CodePudding user response:

Use below

execute immediate (select '''
select * from records 
pivot (count(*) for event in (''' || string_agg('' || event, ',' order by event) || "))"
from t_vocabs
)      

if applied to sample data in your question - output is

enter image description here

  • Related