How to conditionally replace Spark SQL array values using SQL language?-CodePudding

I have this column inside myTable:

myColumn
[red, green]
[green, green, red]

I need to modify it so that I can replace red with 1, green with 2:

myColumn
[1, 2]
[2, 2, 1]

In short, is there a way to apply case clause for each element in the array, row wise?

The closest I've gotten so far:

select replace(replace(to_json(myColumn), 'red', 1), 'green', 2)

On the other hand, in case we have a column of strings, I could simply use:

select (
  case
    when myColumn='red' then 1
    when myColumn='green' then 2
  end
) from myTable;

CodePudding user response：

Assuming that the dataframe has registered a temporary view named tmp, use the following SQL statement to get the result.

sql = """
    select
        collect_list(
            case col
                when 'red' then 1
                when 'green' then 2
            end)
        myColumn
    from
        (select mid,explode(myColumn) col
        from
            (select monotonically_increasing_id() mid,myColumn
            from tmp)
        )
    group by mid
"""
df = spark.sql(sql)
df.show(truncate=False)

CodePudding user response：

Let's create some sample data and a map that contains the substitutions: tou want to make

val df = Seq((1, Seq("red", "green")),
             (2, Seq("green", "green", "red")))
         .toDF("id", "myColumn")
val values = Map("red" -> "1", "green" -> "2")

The most straight forward way would be to define a UDF that does exactly what you want:

val replace = udf((x : Array[String]) =>
    x.map(value => values.getOrElse(value, value)))
df.withColumn("myColumn", replace('myColumn)).show

 --- --------- 
| id| myColumn|
 --- --------- 
|  1|   [1, 2]|
|  2|[2, 2, 1]|
 --- ---------

Without UDFs, you could transform the array into a string with concat_ws using separators that are not in your array. Then we could use string functions to make the edits:

val sep = ","
val replace = values
    .foldLeft(col("myColumn")){ case (column, (key, value)) =>
        regexp_replace(column, sep   key   sep, sep   value   sep) 
    }
df.withColumn("myColumn", concat(lit(sep), concat_ws(sep sep, 'myColumn), lit(sep)))
  .withColumn("myColumn", regexp_replace(replace, "(^,)|(,$)", ""))
  .withColumn("myColumn", split('myColumn, sep sep))
  .show

CodePudding user response：

In pure Spark SQL, you could convert your array into a string with concat_ws, make the substitutions with regexp_replace and then recreate the array with split.

select split(
    regexp_replace(
        regexp_replace(
            concat_ws(',', myColumn)
        , 'red', '1')
    , 'green', '2')
, ',') myColumn from df