spark scala - how to calculate the md5 of multiple column based on the list of columns-CodePudding

I have list which contains the names of column. I need to concat this columns and then create and md5 and append to the dataframe.

example

I have table with fname, lname, address. my resulting dataframe should look like. fname, lname, address and md5(concat_ws(",",fname, lname))

my list would contains fname and lname.

code

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val businessCols = List("fname", "lname")
val df = spark.table(s"{databaseName}.${databaseName}")
val new_df = df.
                withColumn("concatenated_cols",concat_ws(",",$"businessCols": _*)).
                withColumn("md5_hash", md5($"concatenated_cols"))

Error

found   : org.apache.spark.sql.ColumnName
 required: Seq[org.apache.spark.sql.Column]
                withColumn("concatenated_cols",concat_ws(",",$"businessCols": _*)).

CodePudding user response：

import java.security.MessageDigest
import spark.implicits._
import org.apache.spark.sql.functions.concat_ws

case class Test(fname: String, lname: String, address: String)

def md5(s: String): Array[Byte] = {
  MessageDigest.getInstance("MD5").digest(s.getBytes)
}
val df1 = spark.createDataset(Seq(Test("1", "1", "2")))
df1.withColumn("md5(concat_ws(\",\",fname, lname))", lit(md5(concat_ws(",", $"fname", $"lname")
  .toString()))).show(false)

CodePudding user response：

I was able to fix this by creating a new list

import org.apache.spark.sql.Column
val cols = businessCols.foldLeft(List[Column]()){(acc,l) => acc    List(col(l))}