Regexp to parse string into two separate fields based on the last characters in Oracle SQL-CodePudding

I'm working on the ETL for a warehouse management system and I've come across a warehouse location map that's all over the place in terms of naming conventions so my usual method of parsing the warehouse location into warehouse location and warehouse location level isn't working. I think REGEXP can be used here, but I don't know it that well.

By looking at the last 5 characters, you can tell if it needs to be parsed. Here's the logic

*If First 3 of the last 5 are numbers and Last 2 are letters. Return anything left of last 2 as "ETL Location". Return last 2 as "Level ID".

If First 2 of the last 5 are numbers and 3rd character of last 5 is letter and Last 2 charcters are numbers. Return anything left of last 3 as "ETL Location". Return last 3 as "Level ID"

If first 4 of last 5 are numbers and last character is letter. Return anything left of last character as "ETL Location". Return last 1 as "Level ID"*

This is the desired result.

Given Location	ETL Location	Level
A103AB	A103	AB
A103A01	A103	A01
A103A02	A103	A02
A103456	A103456	null
A103A	A103	A
A103B	A103	B
ABCDEFG	ABCDEFG	null

Can Regexp be used here?

I've done the following, but I'm still getting incorrect results since there's a lot of different naming conventions. If I do it this way and the warehouse makes new locations, logic like this will probably get broken at some point even if it works currently so I'm thinking identifying what is a "Level" and going from there is a better approach here.

case
    when loc_tp_code in ('RACK','PICK','RKL2','PLNW')
    and length(loc_code) = '5'
    THEN substr(loc_code,1,4)
    when loc_tp_code in ('RACK','PICK','RKL2','PLNW')
    and length(loc_code) = '7'
    THEN substr(loc_code,1,6)
    when loc_tp_code in ('RACK','PICK','RKL2','PLNW')
    and length(loc_code) = '6'
    THEN substr(loc_code,1,5)
    when loc_tp_code in ('RACK','PICK','RKL2','PLNW')
    and length(loc_code) = '8'
    and loc_code like 'Z%'
    THEN substr(loc_code,1,6)
    when loc_tp_code in ('RACK','PICK','RKL2','PLNW')
    and length(loc_code) = '8'
    THEN substr(loc_code,1,5)
    ELSE loc_code
END AS "Location",
    case
    when loc_tp_code in ('RACK','PICK','RKL2','PLNW')
    and length(loc_code) = '5'
    THEN substr(loc_code,5,1)
    when loc_tp_code in ('RACK','PICK','RKL2','PLNW')
    and length(loc_code) = '7'
    THEN substr(loc_code,7,1)
    when loc_tp_code in ('RACK','PICK','RKL2','PLNW')
    and length(loc_code) = '6'
    THEN substr(loc_code,6,1)
    when loc_tp_code in ('RACK','PICK','RKL2','PLNW')
    and length(loc_code) = '8'
    and loc_code like 'Z%'
    THEN substr(loc_code,7,2)
    when loc_tp_code in ('RACK','PICK','RKL2','PLNW')
    and length(loc_code) = '8'
    THEN substr(loc_code,6,3)
    ELSE null
END AS "Level ID"

CodePudding user response：

Something like 1 CASE for "ETL location" and 1 for "LEVEL":

    case substr(location,-5)
        when regexp_substr( substr(location,-5), '\d{3}[A-Z]{2}' ) then 
            substr(location, 1, length(location)-5) || substr(location, length(location)-4,3)
        when regexp_substr( substr(location,-5), '\d{2}[A-Z]\d{2}' ) then 
            substr(location, 1, length(location)-5) || substr(location, length(location)-4,3)
        when
    ...
    end as etl_location,
    case substr(location,-5)
        when regexp_substr( substr(location,-5), '\d{3}[A-Z]{2}' ) then 
            substr(location, -2)
        when regexp_substr( substr(location,-5), '\d{2}[A-Z]\d{2}' ) then 
            substr(location, -3)
        when
...
    end as lvl

CodePudding user response：

Based on your test data, I think you can simplify your rule. Here I describe the given location as a regular expression, grouping the components (with parenthesis), then returning them based on the rule of etl-location starts with a letter, then is followed by either all letters to the end OR one or more digits to the end OR one or more digits followed by a letter and whatever else until the end (made optional to hande the cases of NULL warehouse level). In the case of an etl_location, the first group is returned and for the level the 4th group is returned.

You have to study your data to see if this indeed matches your naming rules.

with tbl(data) as (
  select 'A103AB' from dual union all
  select 'A103A01' from dual union all
  select 'A103A02' from dual union all
  select 'A103456' from dual union all
  select 'A103A' from dual union all
  select 'A103B' from dual union all
  select 'ABCDEFG' from dual
)
select data given_loc, 
       regexp_substr(data, '(([A-Z])([A-Z] $|\d $|\d ))([A-Z].*)?', 1, 1, null, 1) ETL_loc,
       regexp_substr(data, '(([A-Z])([A-Z] $|\d $|\d ))([A-Z].*)?', 1, 1, null, 4) wh_level
from tbl;


GIVEN_LOC ETL_LOC WH_LEVEL
--------- ------- --------
A103AB    A103    AB      
A103A01   A103    A01     
A103A02   A103    A02     
A103456   A103456         
A103A     A103    A       
A103B     A103    B       
ABCDEFG   ABCDEFG         

7 rows selected.