Split by delimiter which is contained in a record-CodePudding

I have a column which I am splitting in Snowflake.

The format is as follows:

I have been using split_to_table(A, ',') inside of my query but as you can probably tell this uncorrectly also splits the Scooter > Sprinting, Jogging and Walking record.

Perhaps having the delimiter only work if there is no spaced on either side of it? As I cannot see a different condition that could work.

I have been researching online but haven't found a suitable work around yet, is there anyone that encountered a similar problem in the past?

Thanks

CodePudding user response：

This is a custom rule for the split to table, so we can use a UDTF to apply a custom rule:

create or replace function split_to_table2(STR string, DELIM string, ROW_MUST_CONTAIN string)
returns table (VALUE string)
language javascript
strict immutable
as
$$
{
    initialize: function (argumentInfo, context) {
    },
    processRow: function (row, rowWriter, context) {
        var buffer = "";
        var i;
        const s = row.STR.split(row.DELIM);
        for(i=0; i<s.length-1; i  ) {
            buffer  = s[i];
            if(s[i 1].includes(row.ROW_MUST_CONTAIN)) {
                rowWriter.writeRow({VALUE: buffer});
                buffer = "";
            } else {
                buffer  = row.DELIM
            }
        }
        rowWriter.writeRow({VALUE: s[i]})
    },
}
$$;

select VALUE from 
table(split_to_table2('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ',', '>'))
;

Output:

VALUE
Car > Bike
Bike > Scooter
Scooter > Sprinting, Jogging and Walking
Walking > Flying

This UDTF adds one more parameter than the two in the build in table function split_to_table. The third parameter, ROW_MUST_CONTAIN is the string a row must contain. It splits the string on DELIM, but if it does not have the ROW_MUST_CONTAIN string, it concatenates the strings to form a complete string for a row. In this case we just specify , for the delimiter and > for ROW_MUST_CONTAIN.

CodePudding user response：

Yes, you are right. The only pattern, which I can see, is the one with the whitespace after the comma.

It's a small workaround but we can make use of this pattern. In below code I am replacing such commas, where we do have whitespaces afterwards. Then I am applying split to table function and I am converting the previous replacement back.

It's not super pretty and would crash if your string contains "my_replacement" or any other new pattern, but its working for me:

select replace(t.value, 'my_replacement', ', ') 
from table(
           split_to_table(replace('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ', ', 'my_replacement'),',')) t

CodePudding user response：

We can get a little clever with regexp_replace by replacing the actual delimiters with something else before the table split. I am using double pipes '||' but you can change that to something else. The '\|\|\\1' trick is called back-referencing that allows us to include the captured group (\\1) as part of replacement (\|\|)

set str='car>bike,bike>car,truck, and jeep,horse>cat,truck>car,truck, and jeep';

select $str, *
from table(split_to_table(regexp_replace($str,',([^>,] >)','\|\|\\1'),'||'))