Big Query String Manipulation using SubQuery-CodePudding

I would appreciate a push in the right direction with how this might be achieved using GCP Big Query, please.

I have a column in my table of type string, inside this string there are a repeating sequence of characters and I need to extract and process each of them. To illustrate, lets say the column name is 'instruments'. A possible value for instruments could be:

'band=false;inst=basoon,inst=cello;inst=guitar;cases=false,permits=false'

In which case I need to extract 'basoon', 'cello' and 'guitar'.

I'm more or less a SQL newbie, sorry. So far I have:

SELECT 
    bandId,
    REGEXP_EXTRACT(instruments, r'inst=.*?\;') AS INSTS
   
FROM `inventory.band.mytable`;

This extracts the instruments substring ('inst=basoon,inst=cello;inst=guitar;') and gives me an output column 'INSTS' but now I think I need to split the values in that column on the comma and do some further processing. This is where I'm stuck as I cannot see how to structure additional queries or processing blocks.

How can I reference the INSTS in order to do subsequent processing? Documentation suggests I should be buildin subqueries using WITH but I can't seem to get anything going. Could some kind soul give me a push in the right direction, please?

CodePudding user response：

BigQuery has a function SPLIT() that does the same as SPLIT_PART() in other databases.

Assuming that you don't alternate between the comma and the semicolon for separating your «key»=«value» pairs, and only use the semicolon, first you split your instruments string into as many parts that contain inst=. To do that, you use an in-line table of consecutive integers to CROSS JOIN with, so that you can SPLIT(instruments,';',i) with an increasing integer value for i. You will get strings in the format inst=%, of which you want the part after the equal sign. You get that part by applying another SPLIT(), this time with the equal sign as the delimiter, and for the second split part:

WITH indata(bandid,instruments) AS (                                                                                                                                                         
-- some input, don't use in real query ...
-- I assume that you don't alternate between comma and semicolon for the delimiter, and stick to semicolon
  SELECT
  1,'band=false;inst=basoon;inst=cello;inst=guitar;cases=false;permits=false'
  UNION ALL
  SELECT
  2,'band=true;inst=drum;inst=cello;inst=bass;inst=flute;cases=false;permits=true'
  UNION ALL
  SELECT
  3,'band=false;inst=12string;inst=banjo;inst=triangle;inst=tuba;cases=false;permits=true'
)
-- real query starts here, replace following comma with "WITH" ...
,
-- need a series of consecutive integers ...
i(i) AS (
            SELECT 1 
  UNION ALL SELECT 2 
  UNION ALL SELECT 3 
  UNION ALL SELECT 4 
  UNION ALL SELECT 5 
  UNION ALL SELECT 6 
)
SELECT
  bandid
, i
, SPLIT(SPLIT(instruments,';',i),'=',2) AS instrument
FROM indata CROSS JOIN i
WHERE SPLIT(instruments,';',i) like 'inst=%'
ORDER BY 1
-- out  bandid | i | instrument 
-- out -------- --- ------------
-- out       1 | 2 | basoon
-- out       1 | 3 | cello
-- out       1 | 4 | guitar
-- out       2 | 2 | drum
-- out       2 | 3 | cello
-- out       2 | 4 | bass
-- out       2 | 5 | flute
-- out       3 | 2 | 12string
-- out       3 | 3 | banjo
-- out       3 | 4 | triangle
-- out       3 | 5 | tuba

CodePudding user response：

Consider below few options (just to demonstrate different technics here)

Option 1

select bandId, 
  ( select string_agg(split(kv, '=')[offset(1)])
    from unnest(split(instruments, ';')) kv
    where split(kv, '=')[offset(0)] = 'inst'
  ) as insts
from `inventory.band.mytable`

Option 2 (for obvious reason this one would be my choice)

select bandId, 
  array_to_string(regexp_extract_all(instruments, r'inst=([^;$] )'), ',') instrs 
from `inventory.band.mytable`

If applied to sample data in your question - output in both cases is