Home > other >  Split records with complex delimiter
Split records with complex delimiter

Time:03-29

I have an incoming record with a complex column delimiter and need to tokenize the record. One of the delimiter characters can be part of the data.

I am looking for a regex expression. Required to use on Teradata 16.1 with the function "REGEXP_SUBSTR".

There can max of 5 columns to tokenize. Planing to use case statements in Teradata to tokenize the columns. I guess regular expression for one token will do the trick.

Case#1: Column delimiter is ' - ' 
Sample data: On-e - tw o - thr$ee
Required output : [On-e, tw o, thr$ee]

My attempt : ([\S]*)\s{1}\-{1}\s{1}

Case#2 : Column delimiter is '::' 
Sample data : On:e::tw:o::thr$ee 
Required output : [On:e, tw:o, thr$ee]

Case#3 : Column delimiter is ':;' 
Sample data : On:e:;tw;o:;thr$ee
Required output : [On:e, tw;o, thr$ee]

The above 3 cases are independent and do not occur together ie., 3 different solutions are required

CodePudding user response:

If you absolutely must use RegEx for this, you could do it like in the examples shown below using capture groups.



Generic example:

/(?<data>. ?)($delimiter|$)/gm

(?<data>. ?) named capture group data, matching:
. any character
? occuring between one and unlimited times

followed by

($delimiter|$) another capture group, matching:
$delimiter - replace this with regex matching your delimiter string
| or
$ end of string


Picking up your examples:

Case #1:

Column delimiter is ' - '

/(?<data>. ?)(\s-\s|$)/gm

(https://regex101.com/r/qMYxAY/1)


Case #2:

Column delimiter is '::'

/(?<data>. ?)(\:\:|$)/gm

https://regex101.com/r/IzaAoA/1


Case #3:

Column delimiter is ':;'

(?<data>. ?)(\:\;|$)

https://regex101.com/r/g1MUb6/1

CodePudding user response:

Normally you would use STRTOK to split a string on a delimiter. But strtok can't handle a multi-character delimiter. One moderately over-complicated approach is to replace the multiple characters of the delimiter with a single character and split on that. For example:

select
strtok(oreplace(<your column>,' - ', '|'),'|',1) as one,
strtok(oreplace(somecol,' - ', '|'),'|',2) as two,
strtok(oreplace(somecol,' - ', '|'),'|',3) as three,
strtok(oreplace(<your column>,' - ', '|'),'|',4) as four,
strtok(oreplace(<your column>,' - ', '|'),'|',5) as five

If there are only three occurrences, like in your samples, it just returns null for the other two.

  • Related