Home > other >  Snowflake regexp_like breaks when string column contains end of line characters
Snowflake regexp_like breaks when string column contains end of line characters

Time:09-08

We've been pulling our hair out debugging Snowflakes regexp_like/rlike function on a string column. It started silently failing and always returning a False when the string contained a line break (\n).

Example:

select regexp_like('looking for a keyword', '.*keyword.*', 'i')
-> True
select regexp_like('looking for a 
keyword', '.*keyword.*', 'i')
-> False

Both expressions, however, should evaluate to True.

Some observations:

  1. The ilike function doesn't seem to suffer from this issue.
select 'looking for a \n keyword' ilike '%keyword%'
-> True

ilike and regexp_like are differently implemented which would explain why one works but not the other.

  1. The (programmatic) workaround is to strip all linebreak characters.
select regexp_like(replace('looking for a 
keyword','\n',''), '.*keyword.*', 'i')
->True

We obtained the string entries in our table from websites parsed with beautiful soup. While beautifulsoup removes HTML <br> line breaks, it doesn't seem to remove the line breaks encoded through \n and \r. In retrospective, I see that regexp could be tripped up by a newline character.

Is this expected behavior and specific to Snowflake?

CodePudding user response:

It is possible to provide parameter s:

select regexp_like('looking for a 
keyword', '.*keyword.*', 'is')
-- TRUE

Specifying the Parameters for the Regular Expression:

s - Enables the POSIX wildcard character . to match \n. By default, wildcard character matching is disabled.

The default string is simply c, which specifies:

POSIX wildcard character . does not match \n newline characters.

  • Related