I have an incoming record with a complex column delimiter and need to tokenize the record. One of the delimiter characters can be part of the data.
I am looking for a regex expression. Required to use on Teradata 16.1 with the function "REGEXP_SUBSTR".
There can max of 5 columns to tokenize. Planing to use case statements in Teradata to tokenize the columns. I guess regular expression for one token will do the trick.
Case#1: Column delimiter is ' - '
Sample data: On-e - tw o - thr$ee
Required output : [On-e, tw o, thr$ee]
My attempt : ([\S]*)\s{1}\-{1}\s{1}
Case#2 : Column delimiter is '::'
Sample data : On:e::tw:o::thr$ee
Required output : [On:e, tw:o, thr$ee]
Case#3 : Column delimiter is ':;'
Sample data : On:e:;tw;o:;thr$ee
Required output : [On:e, tw;o, thr$ee]
The above 3 cases are independent and do not occur together ie., 3 different solutions are required
CodePudding user response:
If you absolutely must use RegEx for this, you could do it like in the examples shown below using capture groups.
Generic example:
/(?<data>. ?)($delimiter|$)/gm
(?<data>. ?)
named capture group data, matching:
.
any character
?
occuring between one and unlimited times
followed by
($delimiter|$)
another capture group, matching:
$delimiter
- replace this with regex matching your delimiter string
|
or
$
end of string
Picking up your examples:
Case #1:
Column delimiter is ' - '
/(?<data>. ?)(\s-\s|$)/gm
(https://regex101.com/r/qMYxAY/1)
Case #2:
Column delimiter is '::'
/(?<data>. ?)(\:\:|$)/gm
https://regex101.com/r/IzaAoA/1
Case #3:
Column delimiter is ':;'
(?<data>. ?)(\:\;|$)
https://regex101.com/r/g1MUb6/1
CodePudding user response:
Normally you would use STRTOK
to split a string on a delimiter. But strtok can't handle a multi-character delimiter. One moderately over-complicated approach is to replace the multiple characters of the delimiter with a single character and split on that. For example:
select
strtok(oreplace(<your column>,' - ', '|'),'|',1) as one,
strtok(oreplace(somecol,' - ', '|'),'|',2) as two,
strtok(oreplace(somecol,' - ', '|'),'|',3) as three,
strtok(oreplace(<your column>,' - ', '|'),'|',4) as four,
strtok(oreplace(<your column>,' - ', '|'),'|',5) as five
If there are only three occurrences, like in your samples, it just returns null for the other two.