Home > Back-end >  How do I detect an emoji in a Snowflake VARCHAR?
How do I detect an emoji in a Snowflake VARCHAR?

Time:11-12

I have a 1B row table of chat messages with a max-sized VARCHAR column for the raw message text. How do I -

  1. Select just the rows that contain 1 or more emoji characters
  2. Efficiently filter out rows without emoji (if needed to improve performance)

CodePudding user response:

Combining the knowledge shared by Lukasz and Greg, plus discovering an additional hidden character, I can present this solution:

  • 1 JS UDF that detects the presence of emojis.
  • 1 JS UDF that detects if the string is only emojis.
create or replace function has_emoji(X text)
returns boolean
language javascript
as $$
return /\p{Extended_Pictographic}/u.test(X);
$$;

create or replace function only_emoji(X text)
returns boolean
language javascript
as $$
return /^[\p{Extended_Pictographic}\p{Emoji_Component}] $/u.test(X);
$$;

Sample use:

with data as (
  select $1 t
  from values('❄️'),('❄️ is the data ☁️'),('no emoji')
)

select *, has_emoji(t), only_emoji(t)
from data
;

enter image description here

--

Some research:

  • '❄️'.replace(/\p{Extended_Pictographic}/ug, '') returns ''
  • but '❄️'.replace(/\p{Extended_Pictographic}/ug, '') is not equal to ''
  • Turns out there are hidden characters that still need to be replaced, and matched by Emoji_Component

enter image description here

CodePudding user response:

Using Snowflake's enter image description here

Related: How to detect emoji using javascript

  • Related