Home > Software design >  Extract substrings based on multiple conditions in postgresql
Extract substrings based on multiple conditions in postgresql

Time:08-05

I have following column in a postgres table.

col1                                start    end
p.[A138T;S160G;D221_E222delinsGK]   138      138
p.[A138T;S160G;D221_E222delinsGK]   160      160
p.[A138T;S160G;D221_E222delinsGK]   221      222

I am looking for a way to split the ; separated values in multiple rows. The expected output is:

col1                                start    end
p.A138T                             138      138
p.S160G                             160      160
p.D221_E222delinsGK                 221      222

I am using following query but it does not works for the third row.

select
       case 
            when start = "end" and col1 like 'p.[%' then 'p.'||(regexp_match(col1, '([A-Z]'||start||'[A-Z])'))[1] 
            when start != "end" and col1 like 'p.[%' then 'p.'||(regexp_match(col1, '[A-Z\d _delins] '||start||'[A-Z\d _delins] '))[1] 
            else col1,
                   
            start,
            end
from table

Any help is highly appreciated!

CodePudding user response:

I prefer substring() to extract strings based on a regex, so I don't have to deal with the array that regexp_match() returns.

select case 
          when col1 like 'p.%' and start = "end" 
            then 'p.'||substring(col1 from concat('[A-Z]', start, '[A-Z]'))
          when col1 like 'p.%' and start <> "end" 
            then 'p.'||substring(col1 from concat('[A-Z]', start, '_[A-Z]', "end", 'delins[^\]] '))
          else col1
       end as col1, 
       start,
       "end"
from the_table;

As far as I can tell, the regex for the second alternative adds the _delins incorrectly to the regex marking the "start" value. And if that is a constant value, it should not be part of a regex "range" because [delins]' would also match the string sindel. Not sure if you intended that.

Another option would be to simply take everything after the "end" marker and exclude the closing ] by using

substring(col1 from concat('[A-Z]', start, '_[A-Z]', "end", '[^\]] '))
  • Related