Home > other >  How to validate with regex the format of an itinerary field in bigquery to extract the departure and
How to validate with regex the format of an itinerary field in bigquery to extract the departure and

Time:12-29

Good morning,

I was working on a regex to validate and extract some iata codes from an itinerary to create a departure and arrival column

I have for example this input:

123424-LYS-OPO-PAR
LYS-OPO
OPO-PAR
OPO-PAR-LYS
LYS-OPO-PAR-...-BKK
12312312-312-231-231

To get this:

departure arrival
LYS PAR
LYS OPO
OPO PAR
OPO LYS
LYS BKK
Nothing Nothing

To get this in BQ. I was trying:

select
    case
        when regexp_contains(iata.itinerary, r'^\W-|[A-Z]{3}-[A-Z]{3}')
            then split(regexp_substr(iata.itinerary), r'^\W|[A-Z]{3}-[A-Z]{3}'), '-')[offset(0)] 
        else 'Nothing'
    end as departure,
    case
        when regexp_contains(iata.itinerary, r'^\W-|[A-Z]{3}-[A-Z]{3}')
            then ARRAY_REVERSE(split(regexp_substr(upper(iata.itinerary), r'^\W|[A-Z]{3}-[A-Z]{3}'), '-'))[safe_offset(0)]
        else 'Nothing'
    end as arrival 
from iata_data iata

But I dont know how to deal with more than X number of iata codes in BQ

Thanks in advance

CodePudding user response:

You might consider below.

WITH iata_data AS (
  SELECT '123424-LYS-OPO-PAR' itinerary UNION ALL
  SELECT 'LYS-OPO' UNION ALL
  SELECT 'OPO-PAR' UNION ALL
  SELECT 'OPO-PAR-LYS' UNION ALL
  SELECT 'LYS-OPO-PAR-...-BKK' UNION ALL
  SELECT '12312312-312-231-231'
)
SELECT COALESCE(REGEXP_EXTRACT_ALL(iata.itinerary, '[A-Z]{3}')[SAFE_OFFSET(0)], 'Nothing') departure,
       COALESCE(ARRAY_REVERSE(REGEXP_EXTRACT_ALL(iata.itinerary, '[A-Z]{3}'))[SAFE_OFFSET(0)], 'Nothing') arrival
  FROM iata_data iata;

or,

SELECT COALESCE(codes[SAFE_OFFSET(0)], 'Nothing') departure,
       COALESCE(codes[SAFE_OFFSET(ARRAY_LENGTH(codes) - 1)], 'Nothing') arrival
  FROM iata_data iata, UNNEST([STRUCT(REGEXP_EXTRACT_ALL(iata.itinerary, '[A-Z]{3}') AS codes)]);

Query results

enter image description here

  • Related