Home > Enterprise >  Extract from string in BigQuery using regexp_extract
Extract from string in BigQuery using regexp_extract

Time:12-15

I have a long string in BigQuery where that I need to extract out some data.

Part of the string looks like this:

... source: "agent" resolved_query: "hi" score: 0.61254 parameters ...

I want to extract out data such as agent, hi, and 0.61254.

I'm trying to use regexp_extract but I can't get the regexp to work correctly:

select
regexp_extract([col],r'score: [0-9]*\.[0-9] ') as score,
regexp_extract([col],r'source: [^"]*') as source
from [table]

What should the regexp be to just get agent or 0.61254 without the field name and no quotation marks?

Thank you in advance.

CodePudding user response:

I love non-trivial approaches - below one of such -

select * except(col) from (
  select col, split(kv, ': ')[offset(0)] key,
    trim(split(kv, ': ')[offset(1)], '"') value,
  from your_table,
  unnest(regexp_extract_all(col, r'\w : "?[\w.] "?')) kv
)
pivot (min(value) for key in ('source', 'resolved_query', 'score'))     

if applied to sample data as in your question

with your_table as (
  select '...  source: "agent"  resolved_query: "hi"  score: 0.61254  parameters ... ' col union all
  select '...  source: "agent2"  resolved_query: "hello"  score: 0.12345  parameters ... ' col 
)         

the output is

enter image description here

As you might noticed, the benefit of such approach is obvious - if you have more fields/attributes to extract - you do not need to clone the lines of code for each of attribute - you just add yet another value in last line's list - the whole code is always the same

CodePudding user response:

You can use

select
regexp_extract([col],r'score:\s*(\d*\.?\d )') as score,
regexp_extract([col],r'resolved_query:\s*"([^"]*)"') as resolved_query,
regexp_extract([col],r'source:\s*"([^"]*)"') as source
from [table]

Here,

  • score:\s*(\d*\.?\d ) matches score: string, then any zero or more whitespaces, and then there is a capturing group with ID=1 that captures zero or more digits, an optional . and then one or more digits
  • resolved_query:\s*"([^"]*)" matches a resolved_query: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char
  • source:\s*"([^"]*)" matches a source: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char.
  • Related