How to flatten an Array Struct to columns in Google BigQuery-CodePudding

Consider the table sample below where data is of type array<struct<key:string,value:string>>, with the repeated keys: 'Date', 'Country' and 'Brand':

source	data.key	data.value
first_file	Date	2022-12-14
	Country	Germany
	Brand	Mercedes
	Date	2022-12-15
	Country	Germany
	Brand	BMW
second_file	Date	2022-12-13
	Country	Sweden
	Brand	Volvo
	Date	2022-12-10
	Country	France
	Brand	Renault

By 'repeated' keys I mean that every data.key entry always consists of those keys (Date, Country, Brand). In this example, they get repeated twice per row entry, but in the real table they might get repeated even more times per unique entry. My desired result is:

source	date	country	brand
first_file	2022-12-14	Germany	Mercedes
first_file	2022-12-15	Germany	BMW
second_file	2022-12-13	Sweden	Volvo
second_file	2022-12-10	France	Renault

Any help on how I can reach that result?

If it helps, I've managed to turn the sample table into the format below in case you'd like to try a solution to this table instead:

source	date.key	date.value	country.key	country.value	brand.key	brand.value
first_file	Date	2022-12-14	Country	Germany	Brand	Mercedes
first_file	Date	2022-12-15	Country	Germany	Brand	BMW
second_file	Date	2022-12-13	Country	Sweden	Brand	Volvo
second_file	Date	2022-12-10	Country	France	Brand	Renault

Thanks!

CodePudding user response：

WITH
  tmp AS (
  SELECT
    source,
    key,
    value
  FROM
    UNNEST(ARRAY<STRUCT<source string, data ARRAY<STRUCT<key string, value string>>>>[ 
      ("first_file", [("Date","2022-12-14"), ("Country","Germany"),("Brand","Mercedes")]),
      ("second_file", [("Date","2022-12-13"), ("Country","Sweden"),("Brand","Volvo")])
      ]),
    UNNEST(data) ) -- unnest data first
SELECT
  source, date[SAFE_OFFSET(0)] as date, country[SAFE_OFFSET(0)] as country, brand[SAFE_OFFSET(0)] as brand, 
FROM
  tmp PIVOT (
    ARRAY_AGG(value IGNORE NULLS) FOR key IN ("Date", "Country", "Brand")) -- pivot table

CodePudding user response：

You can use the following query :

with sources AS 
(
  select 
    'first_file' as source,
    [
      struct('Date' as key, '2022-12-14' as value),
      struct('Country' as key, 'Germany' as value),
      struct('Brand' as key, 'Mercedes' as value)
    ] as data
  UNION ALL
  select 
    'second_file' as source,
    [
      struct('Date' as key, '2022-12-13' as value),
      struct('Country' as key, 'Sweden' as value),
      struct('Brand' as key, 'Volvo' as value)
    ] as data
)

select
  source,
  (SELECT value FROM UNNEST(data) WHERE key = 'Date') AS Date,
  (SELECT value FROM UNNEST(data) WHERE key = 'Country') AS Country,
  (SELECT value FROM UNNEST(data) WHERE key = 'Brand') AS Brand,
from sources;

You can also use an udf to centralize the logic :

CREATE TEMP FUNCTION getValue(k STRING, arr ANY TYPE) AS
((SELECT value FROM UNNEST(arr) WHERE key = k));

with sources AS 
(
  select 
    'first_file' as source,
    [
      struct('Date' as key, '2022-12-14' as value),
      struct('Country' as key, 'Germany' as value),
      struct('Brand' as key, 'Mercedes' as value)
    ] as data
  UNION ALL
  select 
    'second_file' as source,
    [
      struct('Date' as key, '2022-12-13' as value),
      struct('Country' as key, 'Sweden' as value),
      struct('Brand' as key, 'Volvo' as value)
    ] as data
)
SELECT 
  source,
  getValue('Date', data) AS Date,
  getValue('Country', data) AS Country,
  getValue('Brand', data) AS Brand
FROM sources;

The result is :