Alternative to nesting regex replace in PostgreSQL functions?-CodePudding

Right now, I have a view with a mess of common, conditional string replacement and substitutions for an open text field - in this example, regional classification.

(Please ignore the accuracy of geography, I'm just working with historical standard assignments. Also, I know I could speed things up with REPLACE or even just cleaning the RegEx statements for lookback - I'm just asking about the variable/nesting here.)

    CREATE OR REPLACE FUNCTION public.region_cleanup(record_region text)
     RETURNS text
     LANGUAGE sql
     STRICT
    AS $function$
    SELECT  REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(record_region,'(NORTH AMERICA\s\-\sUSA\s\-\sUSA)','USA')
            ,'Rest\sof\sthe\sWorld\s\-\s','')
            ,'NORTH\sAMERICA\s\-\sCANADA','NORTH AMERICA - Canada')
            ,'\&amp\;','&')
            ,'Georgia\s\-\sGeorgia','MIDDLE EAST - Georgia')
            ,'EUROPE - Turkey','MIDDLE EAST - Turkey')

A sample output using this function would look like this in my dataset, pulling out records impacted (some are already in the correct format):

record_region_input	record_region_output
NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro	USA - NORTHEAST - Massachusetts - Boston Metro
NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia	USA - MIDATLANTIC - Virginia
Rest of the World - ASIA - Thailand	ASIA - Thailand
Rest of the World - EUROPE - Portugal	EUROPE - Portugal
Rest of the World - ASIA - China - Shanghai Metro	ASIA - China - Shanghai Metro
Georgia - Georgia	MIDDLE EAST - Georgia

This is... fine. Regex is needed since there's tons of variability on what may come before or after these strings, and I have a proper validation list elsewhere. This is just a bulk scrub of common historical naming issues.

The problem is where I get hundreds of these kind of "known substitutions" (100 ) for things like company naming or cross-department standards. Having dozens and dozens of REGEXP_REPLACE( nested statements makes editing/adding/dropping anything a maddening game of counting.

I'm trying to clean data within Postgres exclusively, since my current pipeline doesn't always allow for standardization prior to upload. I know how I'd tackle this cleanly outside of pure SQL, but in a 'vanilla' PostgreSQL instance (v12 ) is there a better method for transforming strings for a view?

Updated with a sample input/output table using the example function.

CodePudding user response：

If when you will split a string of data into additional regions then maybe replacing regions will be easy for you. For example:

with tb as (
    select 1 as id, 'NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro' as record_region_input
    union all 
    select 2 as id, 'NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia'
    union all 
    select 3 as id, 'Rest of the World - ASIA - China - Shanghai Metro' 
)
select * from (
    select distinct tb.id, unnest(string_to_array(record_region_input, ' - ')) as region from tb 
    order by tb.id 
) a1 where a1.region not in ('NORTH AMERICA', 'Rest of the World');

-- Result: 
1   Boston Metro
1   Massachusetts
1   NORTHEAST
1   USA
2   MIDATLANTIC
2   USA
2   Virginia
3   ASIA
3   China
3   Shanghai Metro

After then, for example, for duplicating regions you can use distinct, for unnecessary regions you can use NOT in, and you can use like '%ASIA%' to get all regions which contain ASIA and etc. After all processes, you can merge the corrected string again. Example:

with tb as (
    select 1 as id, 'NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro' as record_region_input
    union all 
    select 2 as id, 'NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia'
    union all 
    select 3 as id, 'Rest of the World - ASIA - China - Shanghai Metro' 
)
select a1.id, string_agg(a1.region, ' - ')  from (
    select distinct tb.id, unnest(string_to_array(record_region_input, ' - ')) as region from tb 
    order by tb.id 
) a1 where a1.region not in ('NORTH AMERICA', 'Rest of the World')
group by a1.id 

-- Return: 
1   Boston Metro - Massachusetts - NORTHEAST - USA
2   MIDATLANTIC - USA - Virginia
3   ASIA - China - Shanghai Metro

This is a simple idea, maybe this idea helps you to replace regions.