How to find duplicates in a string using Spark SQL?-CodePudding

I want to find duplicates in a string, I wonder is there any way that we can find this using spark sql. Below is the query which i've written.

spark.sql("""select case when lower(value) like '%,code,%' or lower(value) like '%,code%' or lower(value) like '%code,%' then 'Y' else 'N' end as value from input""").show(false)

Input :

val sample = Seq(("code,code")).toDF("value")

The issue is what if the value 'code' repeats twice? in that case my requirement is to return flag 'N'. Is there any way to do that using Spark SQL? Please share your suggestions. TIA

CodePudding user response：

You can establish the number of occurances using length and replace

set @a = 'code,aaa';
set @b = 'code,code';
set @c = 'aaa,bbb,ccc';

select @a, length(@a) - length(replace(@a,'code','')) lengthdiff,
        case when length(@a) - length(replace(@a,'code','')) = 4 then 'y'
             when length(@a) - length(replace(@a,'code','')) >= 4 then 'n'
        else 'n'
        end  occurances
;

 ---------- ------------ ------------ 
| @a       | lengthdiff | occurances |
 ---------- ------------ ------------ 
| code,aaa |          4 | y          |
 ---------- ------------ ------------ 
1 row in set (0.001 sec)

select @b, length(@b) - length(replace(@b,'code','')) lengthdiff,
        case when length(@b) - length(replace(@b,'code','')) = 4 then 'y'
             when length(@b) - length(replace(@b,'code','')) > 4 then 'n'
        else 'n'
        end  occurances
;

 ----------- ------------ ------------ 
| @b        | lengthdiff | occurances |
 ----------- ------------ ------------ 
| code,code |          8 | n          |
 ----------- ------------ ------------ 
1 row in set (0.000 sec)

select @c, length(@c) - length(replace(@c,'code','')) lengthdiff,
        case when length(@c) - length(replace(@c,'code','')) = 4 then 'y'
             when length(@c) - length(replace(@c,'code','')) > 4 then 'n'
        else 'n'
        end  occurances
;

 ------------- ------------ ------------ 
| @c          | lengthdiff | occurances |
 ------------- ------------ ------------ 
| aaa,bbb,ccc |          0 | n          |
 ------------- ------------ ------------ 
1 row in set (0.001 sec)