Home > Net >  count number of times a regex pattern occurs in hive
count number of times a regex pattern occurs in hive

Time:11-23

I have a string variable stored in hive as follows

stringvar
AA1,BB3,CD4
AA12,XJ5

I would like to count (and filter on) how many times the regex pattern \w\w\d occurs. In the example, in the first row there are obviously three such examples. How can I do that without resorting to lateral views and explosions of stringvar (too expensive)?

Thanks!

CodePudding user response:

You can split string by pattern and calculate size of result array - 1.

Demo:

select size(split('AA1,BB3,CD4','\\w\\w\\d'))-1  --returns 3
select size(split('AA12,XJ5','\\w\\w\\d'))-1     --returns 2
select size(split('AAxx,XJx','\\w\\w\\d'))-1     --returns 0
select size(split('','\\w\\w\\d'))-1             --returns 0

If column is null-able than special care should be taken. For example like this (depends on what you need to be returned in case of NULL):

select case when col is null then 0
            else size(split(col,'\\w\\w\\d'))-1 
       end

Or simply convert NULL to empty string using NVL function:

select size(split(NVL(col,''),'\\w\\w\\d'))-1 

The solution above is the most flexible one, you can count the number of occurrences and use it for complex filtering/join/etc.

In case you just need to filter records with fixed number of pattern occurrences or at least fixed number and do not need to know exact count then simple RLIKE without splitting is the cheapest method.

For example check for at least 2 repeats:

select 'AA1,BB3,CD4' rlike('\\w\\w\\d ,\\w\\w\\d ') --returns true, can be used in WHERE
  • Related