I want to find duplicates in a string, I wonder is there any way that we can find this using spark sql. Below is the query which i've written.
spark.sql("""select case when lower(value) like '%,code,%' or lower(value) like '%,code%' or lower(value) like '%code,%' then 'Y' else 'N' end as value from input""").show(false)
Input :
val sample = Seq(("code,code")).toDF("value")
The issue is what if the value 'code' repeats twice? in that case my requirement is to return flag 'N'. Is there any way to do that using Spark SQL? Please share your suggestions. TIA
CodePudding user response:
You can establish the number of occurances using length and replace
set @a = 'code,aaa';
set @b = 'code,code';
set @c = 'aaa,bbb,ccc';
select @a, length(@a) - length(replace(@a,'code','')) lengthdiff,
case when length(@a) - length(replace(@a,'code','')) = 4 then 'y'
when length(@a) - length(replace(@a,'code','')) >= 4 then 'n'
else 'n'
end occurances
;
---------- ------------ ------------
| @a | lengthdiff | occurances |
---------- ------------ ------------
| code,aaa | 4 | y |
---------- ------------ ------------
1 row in set (0.001 sec)
select @b, length(@b) - length(replace(@b,'code','')) lengthdiff,
case when length(@b) - length(replace(@b,'code','')) = 4 then 'y'
when length(@b) - length(replace(@b,'code','')) > 4 then 'n'
else 'n'
end occurances
;
----------- ------------ ------------
| @b | lengthdiff | occurances |
----------- ------------ ------------
| code,code | 8 | n |
----------- ------------ ------------
1 row in set (0.000 sec)
select @c, length(@c) - length(replace(@c,'code','')) lengthdiff,
case when length(@c) - length(replace(@c,'code','')) = 4 then 'y'
when length(@c) - length(replace(@c,'code','')) > 4 then 'n'
else 'n'
end occurances
;
------------- ------------ ------------
| @c | lengthdiff | occurances |
------------- ------------ ------------
| aaa,bbb,ccc | 0 | n |
------------- ------------ ------------
1 row in set (0.001 sec)