Home > Software engineering >  Regex on Presto
Regex on Presto

Time:06-29

I have data like this: a35={"abc":"D1,9,12, 23, 24, 25, 26"} on a column.

What I want to do is, using Presto, extract only the id (on this case a35) on a new column if and only if I have "abc":"D1 on the beginning of the json.

So far I am trying to make this regex work for extracting the id, but no success, nothing returns from my query...

    --sample data
WITH dataset(id_str) AS (
  SELECT ('a35={"abc":"D1,9,12, 23, 24, 25, 26"}')
)

--query
SELECT regexp_extract_all(id_str, '"\b(?<id>\w{3})\=\{\"abc\"\:\"D1\,"')
FROM dataset;

Any ideas?

CodePudding user response:

A completely overkill method for this use case, but it might be useful for more advanced applications:

WITH dataset(id_str) AS (
    SELECT ('a35={"abc":"D1,9,12, 23, 24, 25, 26"}')
)
SELECT map_keys(
        map_filter(
            CAST(
                   json_parse(regexp_replace(id_str, '(\d|\w*)=(.*)', '{"$1":$2}') -- Converts string to valid JSON.
                ) as MAP<VARCHAR, JSON>) -- Converts JSON to MAP of JSON
            , (k,v) -> starts_with(CAST(json_extract(v, '$.abc') AS VARCHAR), 'D1')) -- Filters only the desired values
        ) as keys -- extracts the keys.
FROM dataset

Output:

keys
{a35}

CodePudding user response:

Actually the answer was correct, I just had two quotes more than I should on the regex. The correct answer is:

        --sample data
WITH dataset(id_str) AS (
  SELECT ('a35={"abc":"D1,9,12, 23, 24, 25, 26"}')
)

--query
SELECT regexp_extract_all(id_str, '\b(?<id>\w{3})\=\{\"abc\"\:\"D1\,')
FROM dataset;

For reference, I've validated the regex using: https://regex101.com/

  • Related