I have this column inside myTable:
myColumn |
---|
[red, green] |
[green, green, red] |
I need to modify it so that I can replace red with 1, green with 2:
myColumn |
---|
[1, 2] |
[2, 2, 1] |
In short, is there a way to apply case clause for each element in the array, row wise?
The closest I've gotten so far:
select replace(replace(to_json(myColumn), 'red', 1), 'green', 2)
On the other hand, in case we have a column of strings, I could simply use:
select (
case
when myColumn='red' then 1
when myColumn='green' then 2
end
) from myTable;
CodePudding user response:
Assuming that the dataframe has registered a temporary view named tmp
, use the following SQL statement to get the result.
sql = """
select
collect_list(
case col
when 'red' then 1
when 'green' then 2
end)
myColumn
from
(select mid,explode(myColumn) col
from
(select monotonically_increasing_id() mid,myColumn
from tmp)
)
group by mid
"""
df = spark.sql(sql)
df.show(truncate=False)
CodePudding user response:
Let's create some sample data and a map that contains the substitutions: tou want to make
val df = Seq((1, Seq("red", "green")),
(2, Seq("green", "green", "red")))
.toDF("id", "myColumn")
val values = Map("red" -> "1", "green" -> "2")
The most straight forward way would be to define a UDF that does exactly what you want:
val replace = udf((x : Array[String]) =>
x.map(value => values.getOrElse(value, value)))
df.withColumn("myColumn", replace('myColumn)).show
--- ---------
| id| myColumn|
--- ---------
| 1| [1, 2]|
| 2|[2, 2, 1]|
--- ---------
Without UDFs, you could transform the array into a string with concat_ws
using separators that are not in your array. Then we could use string functions to make the edits:
val sep = ","
val replace = values
.foldLeft(col("myColumn")){ case (column, (key, value)) =>
regexp_replace(column, sep key sep, sep value sep)
}
df.withColumn("myColumn", concat(lit(sep), concat_ws(sep sep, 'myColumn), lit(sep)))
.withColumn("myColumn", regexp_replace(replace, "(^,)|(,$)", ""))
.withColumn("myColumn", split('myColumn, sep sep))
.show
CodePudding user response:
In pure Spark SQL, you could convert your array into a string with concat_ws
, make the substitutions with regexp_replace
and then recreate the array with split
.
select split(
regexp_replace(
regexp_replace(
concat_ws(',', myColumn)
, 'red', '1')
, 'green', '2')
, ',') myColumn from df