Home > OS >  BigQuery regex to find string that contains chinese characters
BigQuery regex to find string that contains chinese characters

Time:01-11

I want to find string that contains any chinese characters.

I have the following query in PostgreSQL which works as expected.

with tmp as (
    select '中文zz' as word
    union all
    select '中文' as word
    union all
    select 'english' as word
    union all
    select 'にほんご' as word
    union all
    select 'eng–lish' as word
)
select word,
word ~* '[\x4e00-\x9fff\x3400-\x4dbf]'
from tmp 

Results:

中文zz       true
中文         true
english     false
にほんご     false
eng–lish    false

However, if I convert this SQL in BigQuery, it does not produce the same result.

with tmp as (
    select '中文zz' as word
    union all
    select '中文' as word
    union all
    select 'english' as word
    union all
    select 'にほんご' as word
    union all
    select 'eng–lish' as word
)
select word,
regexp_contains(word, r'[\x4e00-\x9fff\x3400-\x4dbf]')
from tmp

Results:

中文zz      true
中文        false
english     true
にほんご     false
eng–lish    true

CodePudding user response:

You can use the following regex with BigQuery :

with tmp as (
    select '中文zz' as word
    union all
    select '中文' as word
    union all
    select 'english' as word
    union all
    select 'にほんご' as word
    union all
    select 'eng–lish' as word
)
select word,
regexp_contains(word, '''[\u4E00-\u9FA5]''')
from tmp

The result is :

enter image description here

  • Related