I'm trying to modify a column from my dataFrame by removing the suffix from all the rows under that column and I need it in Scala. The values from the column have different lengths and also the suffix is different. For example, I have the following values:
09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0
0978C74C69E8D559A62F860EA36ADF5E-28_3_1
0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1
0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1
22AEA8C8D403643111B781FE31B047E3-0_1_0_0
I need to remove everything after the "_" so that I can get the following values:
09E9894DB868B70EC3B55AFB49975390-0
0978C74C69E8D559A62F860EA36ADF5E-28
0C12FA1DAFA8BCD95E34EE70E0D71D10-0
0D075AA40CFC244E4B0846FA53681B4D
22AEA8C8D403643111B781FE31B047E3-0
CodePudding user response:
You can use user defined functions to perform the operation:
val beforeUnderscore: String => String = original => original.takeWhile(_ != '_')
spark.udf.register("before_underscore", beforeUnderscore)
// and then in your query expression
df.select(expr("before_underscore(COLUMN_NAME)"))
CodePudding user response:
As @werner pointed out in his comment, substring_index
provides a simple solution to this. It is not necessary to wrap this in a call to selectExpr
.
Whereas @AminMal has provided a working solution using a UDF, if a native Spark function can be used then this is preferable for performance.[1]
val df = List(
"09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0",
"0978C74C69E8D559A62F860EA36ADF5E-28_3_1",
"0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1",
"0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1",
"22AEA8C8D403643111B781FE31B047E3-0_1_0_0"
).toDF("col0")
import org.apache.spark.sql.functions.{col, substring_index}
df
.withColumn("col0", substring_index(col("col0"), "_", 1))
.show(false)
gives:
-----------------------------------
|col0 |
-----------------------------------
|09E9894DB868B70EC3B55AFB49975390-0 |
|0978C74C69E8D559A62F860EA36ADF5E-28|
|0C12FA1DAFA8BCD95E34EE70E0D71D10-0 |
|0D075AA40CFC244E4B0846FA53681B4D |
|22AEA8C8D403643111B781FE31B047E3-0 |
-----------------------------------
[1] Is there a performance penalty when composing spark UDFs