Right now, I have a view with a mess of common, conditional string replacement and substitutions for an open text field - in this example, regional classification.
(Please ignore the accuracy of geography, I'm just working with historical standard assignments. Also, I know I could speed things up with REPLACE or even just cleaning the RegEx statements for lookback - I'm just asking about the variable/nesting here.)
CREATE OR REPLACE FUNCTION public.region_cleanup(record_region text)
RETURNS text
LANGUAGE sql
STRICT
AS $function$
SELECT REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(record_region,'(NORTH AMERICA\s\-\sUSA\s\-\sUSA)','USA')
,'Rest\sof\sthe\sWorld\s\-\s','')
,'NORTH\sAMERICA\s\-\sCANADA','NORTH AMERICA - Canada')
,'\&\;','&')
,'Georgia\s\-\sGeorgia','MIDDLE EAST - Georgia')
,'EUROPE - Turkey','MIDDLE EAST - Turkey')
A sample output using this function would look like this in my dataset, pulling out records impacted (some are already in the correct format):
record_region_input | record_region_output |
---|---|
NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro | USA - NORTHEAST - Massachusetts - Boston Metro |
NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia | USA - MIDATLANTIC - Virginia |
Rest of the World - ASIA - Thailand | ASIA - Thailand |
Rest of the World - EUROPE - Portugal | EUROPE - Portugal |
Rest of the World - ASIA - China - Shanghai Metro | ASIA - China - Shanghai Metro |
Georgia - Georgia | MIDDLE EAST - Georgia |
This is... fine. Regex is needed since there's tons of variability on what may come before or after these strings, and I have a proper validation list elsewhere. This is just a bulk scrub of common historical naming issues.
The problem is where I get hundreds of these kind of "known substitutions" (100 ) for things like company naming or cross-department standards. Having dozens and dozens of REGEXP_REPLACE(
nested statements makes editing/adding/dropping anything a maddening game of counting.
I'm trying to clean data within Postgres exclusively, since my current pipeline doesn't always allow for standardization prior to upload. I know how I'd tackle this cleanly outside of pure SQL, but in a 'vanilla' PostgreSQL instance (v12 ) is there a better method for transforming strings for a view?
Updated with a sample input/output table using the example function.
CodePudding user response:
If when you will split a string of data into additional regions then maybe replacing regions will be easy for you. For example:
with tb as (
select 1 as id, 'NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro' as record_region_input
union all
select 2 as id, 'NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia'
union all
select 3 as id, 'Rest of the World - ASIA - China - Shanghai Metro'
)
select * from (
select distinct tb.id, unnest(string_to_array(record_region_input, ' - ')) as region from tb
order by tb.id
) a1 where a1.region not in ('NORTH AMERICA', 'Rest of the World');
-- Result:
1 Boston Metro
1 Massachusetts
1 NORTHEAST
1 USA
2 MIDATLANTIC
2 USA
2 Virginia
3 ASIA
3 China
3 Shanghai Metro
After then, for example, for duplicating regions you can use distinct, for unnecessary regions you can use NOT in, and you can use like '%ASIA%' to get all regions which contain ASIA
and etc. After all processes, you can merge the corrected string again. Example:
with tb as (
select 1 as id, 'NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro' as record_region_input
union all
select 2 as id, 'NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia'
union all
select 3 as id, 'Rest of the World - ASIA - China - Shanghai Metro'
)
select a1.id, string_agg(a1.region, ' - ') from (
select distinct tb.id, unnest(string_to_array(record_region_input, ' - ')) as region from tb
order by tb.id
) a1 where a1.region not in ('NORTH AMERICA', 'Rest of the World')
group by a1.id
-- Return:
1 Boston Metro - Massachusetts - NORTHEAST - USA
2 MIDATLANTIC - USA - Virginia
3 ASIA - China - Shanghai Metro
This is a simple idea, maybe this idea helps you to replace regions.