I want to use the replaceFirst()
function in spark scala sql.
or
Is it possible to use the replaceFirst()
function in spark scala dataframe?
Is this possible without using a UDF?
The function I want to do is:
println("abcdefgbchijkl".replaceFirst("bc","**BC**"))
// a**BC**defgbchijkl
However, the Column Type of DataFrame cannot be applied with Function:
var test0 = Seq("abcdefgbchijkl").toDF("col0")
test0
.select(col("col0").replaceFirst("bc","**BC**"))
.show(false)
/*
<console>:230: error: value replaceFirst is not a member of org.apache.spark.sql.Column
.select(col("col0").replaceFirst("bc","**BC**"))
*/
Also, I don't know how to use it in SQL form:
%sql
-- How to use replaceFirst()
select replaceFirst()
CodePudding user response:
Replacing the first occurrence isn't something I can see supported out of the box by Spark, but it is possible by combining a few functions:
import org.apache.spark.sql.functions.{array_join, col, split}
val test0 = Seq("abcdefgbchijkl").toDF("col0") // replaced `var` with `val`
val stringToReplace = "bc"
val replacement = "**BC**"
test0
// create a temporary column, splitting the string by the first occurrence of `bc`
.withColumn("temp", split(col("col0"), stringToReplace, 2))
// recombine the strings before and after `bc` with the desired replacement
.withColumn("col0", array_join(col("temp"), replacement))
// we no longer need this `temp` column
.drop(col("temp"))
For (spark) SQL:
-- recombine the strings before and after `bc` with the desired replacement
SELECT tempr[0] || "**BC**" || tempr[1] AS col0
FROM (
-- create a temporary column, splitting the string by the first occurrence of `bc`
SELECT split(col0, "bc", 2) AS tempr
FROM (
SELECT 'abcdefgbchijkl' AS col0
)
)