I have a use case where I need to concat a '}' to a string using Spark SQL. The sample dataset is as below:
-------------------------------------- -----
|col_1 |col_2|
-------------------------------------- -----
|{"key_1" : "val_1","key_2" : "val_2"}|abcd |
-------------------------------------- -----
root
|-- col_1: string (nullable = true)
|-- col_2: string (nullable = true)
I want to check the length of col_1
and based on that add value of col_2
into the JSON-formatted string of col_1
. I have written a Spark SQL query as below:
select *, case when length(col_1) = 2 then
concat(substring(col_1, 0, length(col_1) - 1), '"col_2":"',cast(col_2 as STRING), '"}')
else concat(substring(col_1, 0, length(col_1) - 1), ',"col_2":"', cast(col_2 as STRING), '"}')
end as mod_col_1
from df
The query parsing fails when encountering the '}' character. Is there any way to add/escape this character in the query. Or any way to generate the desired string. Expected output: when col_1 = "{}"
-------------------------------------- -----
|col_1 |col_2|
-------------------------------------- -----
|{}|abcd |
-------------------------------------- -----
output:
-------------------------------------- ----------------- ---------------------------------------------------------------
|col_1|col_2|mod_col_1 |
-------------------------------------- ----------------- ---------------------------------------------------------------
|{}|abcd |{'col_2' : 'abcd'}|
-------------------------------------- ----------------- ---------------------------------------------------------------
when, col_1 = {"key_1" : "val_1", "key_2" : "val_2"}
-------------------------------------- -----
|col_1 |col_2|
-------------------------------------- -----
|{"key_1" : "val_1","key_2" : "val_2"}|abcd |
-------------------------------------- -----
output:
-------------------------------------- ----------------- ---------------------------------------------------------------
|col_1|col_2|mod_col_1 |
-------------------------------------- ----------------- ---------------------------------------------------------------
|{"key_1" : "val_1","key_2" : "val_2"}|abcd |{"key_1" : "val_1"key_2" : "val_2","col_2":"abcd"}|
-------------------------------------- ----------------- ---------------------------------------------------------------
Happy to share more details if required.
CodePudding user response:
You can try the regexp_replace() function. Check this
spark.sql(s"""
with t1 ( select '{"key_1" : "val_1","key_2" : "val_2"}' col_1, 'abcd' col_2
union all
select '{}', 'defg' )
select *, case
when col_1 = '{}' then "{ 'col_2' : '"||col_2|| "'}"
else regexp_replace(col_1,"[}]",":")||"'col_2' : '"|| col_2 || "'}"
end x from t1
""").show(50,false)
Output:
------------------------------------- ----- ------------------------------------------------------
|col_1 |col_2|x |
------------------------------------- ----- ------------------------------------------------------
|{"key_1" : "val_1","key_2" : "val_2"}|abcd |{"key_1" : "val_1","key_2" : "val_2":'col_2' : 'abcd'}|
|{} |defg |{ 'col_2' : 'defg'} |
------------------------------------- ----- ------------------------------------------------------
Update: To get the output in double quotes, wrap it in single quotes
spark.sql(s"""
with t1 ( select '{"key_1" : "val_1","key_2" : "val_2"}' col_1, 'abcd' col_2
union all
select '{}', 'defg' )
select *, case
when col_1 = '{}' then '{ "col_2" : "' ||col_2|| '"}'
else regexp_replace(col_1,"[}]",":")||'"col_2" : "'|| col_2 || '"}'
end x from t1
""").show(50,false)
------------------------------------- ----- ------------------------------------------------------
|col_1 |col_2|x |
------------------------------------- ----- ------------------------------------------------------
|{"key_1" : "val_1","key_2" : "val_2"}|abcd |{"key_1" : "val_1","key_2" : "val_2":"col_2" : "abcd"}|
|{} |defg |{ "col_2" : "defg"} |
------------------------------------- ----- ------------------------------------------------------