How to remove all characters that start with "_" from a spark string column-CodePudding

I'm trying to modify a column from my dataFrame by removing the suffix from all the rows under that column and I need it in Scala. The values from the column have different lengths and also the suffix is different. For example, I have the following values:

09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0
0978C74C69E8D559A62F860EA36ADF5E-28_3_1
0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1
0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1
22AEA8C8D403643111B781FE31B047E3-0_1_0_0

I need to remove everything after the "_" so that I can get the following values:

09E9894DB868B70EC3B55AFB49975390-0
0978C74C69E8D559A62F860EA36ADF5E-28
0C12FA1DAFA8BCD95E34EE70E0D71D10-0
0D075AA40CFC244E4B0846FA53681B4D
22AEA8C8D403643111B781FE31B047E3-0

CodePudding user response：

You can use user defined functions to perform the operation:

val beforeUnderscore: String => String = original => original.takeWhile(_ != '_')

spark.udf.register("before_underscore", beforeUnderscore)

// and then in your query expression
df.select(expr("before_underscore(COLUMN_NAME)"))

CodePudding user response：

As @werner pointed out in his comment, substring_index provides a simple solution to this. It is not necessary to wrap this in a call to selectExpr.

Whereas @AminMal has provided a working solution using a UDF, if a native Spark function can be used then this is preferable for performance.[1]

val df = List(
    "09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0",
    "0978C74C69E8D559A62F860EA36ADF5E-28_3_1",
    "0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1",
    "0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1",
    "22AEA8C8D403643111B781FE31B047E3-0_1_0_0"
).toDF("col0")

import org.apache.spark.sql.functions.{col, substring_index}

df
  .withColumn("col0", substring_index(col("col0"), "_", 1))
  .show(false)

gives:

 ----------------------------------- 
|col0                               |
 ----------------------------------- 
|09E9894DB868B70EC3B55AFB49975390-0 |
|0978C74C69E8D559A62F860EA36ADF5E-28|
|0C12FA1DAFA8BCD95E34EE70E0D71D10-0 |
|0D075AA40CFC244E4B0846FA53681B4D   |
|22AEA8C8D403643111B781FE31B047E3-0 |
 -----------------------------------

[1] Is there a performance penalty when composing spark UDFs