I have a long string in BigQuery where that I need to extract out some data.
Part of the string looks like this:
... source: "agent" resolved_query: "hi" score: 0.61254 parameters ...
I want to extract out data such as agent
, hi
, and 0.61254
.
I'm trying to use regexp_extract but I can't get the regexp to work correctly:
select
regexp_extract([col],r'score: [0-9]*\.[0-9] ') as score,
regexp_extract([col],r'source: [^"]*') as source
from [table]
What should the regexp be to just get agent
or 0.61254
without the field name and no quotation marks?
Thank you in advance.
CodePudding user response:
I love non-trivial approaches - below one of such -
select * except(col) from (
select col, split(kv, ': ')[offset(0)] key,
trim(split(kv, ': ')[offset(1)], '"') value,
from your_table,
unnest(regexp_extract_all(col, r'\w : "?[\w.] "?')) kv
)
pivot (min(value) for key in ('source', 'resolved_query', 'score'))
if applied to sample data as in your question
with your_table as (
select '... source: "agent" resolved_query: "hi" score: 0.61254 parameters ... ' col union all
select '... source: "agent2" resolved_query: "hello" score: 0.12345 parameters ... ' col
)
the output is
As you might noticed, the benefit of such approach is obvious - if you have more fields/attributes to extract - you do not need to clone the lines of code for each of attribute - you just add yet another value in last line's list - the whole code is always the same
CodePudding user response:
You can use
select
regexp_extract([col],r'score:\s*(\d*\.?\d )') as score,
regexp_extract([col],r'resolved_query:\s*"([^"]*)"') as resolved_query,
regexp_extract([col],r'source:\s*"([^"]*)"') as source
from [table]
Here,
score:\s*(\d*\.?\d )
matchesscore:
string, then any zero or more whitespaces, and then there is a capturing group with ID=1 that captures zero or more digits, an optional.
and then one or more digitsresolved_query:\s*"([^"]*)"
matches aresolved_query:
string, zero or more whitespaces,"
, then captures into Group 1 any zero or more chars other than"
and then matches a"
charsource:\s*"([^"]*)"
matches asource:
string, zero or more whitespaces,"
, then captures into Group 1 any zero or more chars other than"
and then matches a"
char.