Home > Enterprise >  REGEXP_REPLACE Strings Starting and Ending with Specific Substrings in Snowflake
REGEXP_REPLACE Strings Starting and Ending with Specific Substrings in Snowflake

Time:11-03

I am trying to create a column in a view in Snowflake that replaces any string between strings that I care about with nothing.

This is essentially for the purpose of stripping html formatting out of text. As an example:

<ul>
<li>Text I care about 1
<li>Text I care about 2</li>
<li>Text I care about 3</li>
</ul>

Would should end up like this:


Text I care about 1
Text I care about 2
Text I care about 3

Based on the patterns I am seeing, I think that if I can eliminate any string starting with &lt, and ending with >, I should be able to achieve the result I am looking for.

In testing on different sites it seems like expression REGEXP_REPLACE(originaltext, '&lt. ?>','') should, work, but when attempting in Snowflake it seems to be cutting off the last 'Text I care about' in some cases, and in other cases just is not showing any results at all. I am not sure if there is a syntax difference or something else off in the version of regex snowflake is using, but any advice would be appreciated.

CodePudding user response:

Your regular expression works, but it requires lookarounds.

set sample1 = '<ul>';
set sample2 = '<li>Text I care about 1';
set sample3 = '<li>Text I care about 2</li>';
set sample4 = '<li>Text I care about 3</li>';
set sample5 = '</ul>';

select regexp_replace2($SAMPLE1,'&lt. ?>','');  
select regexp_replace2($SAMPLE2,'&lt. ?>','');
select regexp_replace2($SAMPLE3,'&lt. ?>','');
select regexp_replace2($SAMPLE4,'&lt. ?>','');
select regexp_replace2($SAMPLE5,'&lt. ?>','');

I wrote a UDF library that supports regular expression lookarounds. It attempts to approximate the built-in Snowflake regular expression functions while supporting lookarounds. The names of the UDFs are the same as the built-in regular expression functions with the suffix "2" as shown in the SQL sample.

https://github.com/GregPavlik/SnowflakeUDFs/tree/main/RegularExpressions

CodePudding user response:

Not elegant, but if you know all the special encodings you want to remove, maybe you can just list them like that?

select REGEXP_REPLACE('<li>Text I care about 3</li>', '(<)|(li>)|(/li>)','')

CodePudding user response:

Your challenge is the fact that you are using a LAZY quantifier (. ?) and Snowflake doesn't supports it as according to our docs:

Patterns support the full POSIX ERE (Extended Regular Expression) syntax. For details, see the POSIX basic and extendedsection (in Wikipedia).

The Wikipedia link shows that LAZY is NOT covered by the ERE standard, but is it an extension.

In your case you could maybe use a REGEXP_SUBSTR, like this:

SELECT REGEXP_SUBSTR('<li>Text I care about 1</li>', '(\\w \\s) \\d');

with output like:

Text I care about 1

but this requires a specific pattern on your data.

  • Related