Home > Net >  How to create unique Hash for each row having null value in column in Spark Scala?
How to create unique Hash for each row having null value in column in Spark Scala?

Time:04-29

We have below dataframe in Spark Scala:

firstname middlename lastname dob gender salary
Michael Rose null 2000-05-19 M 4000
Michael null Rose 2000-05-19 M 4000
null Michael Rose 2000-05-19 M 4000

Here, we want to create unique row_hash for each row's data in other dataframe. So, applying below transformation:

val df2 = df.withColumn("row_hash", hash(df.columns.map(col):_*))

And getting below:

firstname middlename lastname dob gender salary row_hash
Michael Rose null 2000-05-19 M 4000 -613328421
Michael null Rose 2000-05-19 M 4000 -613328421
null Michael Rose 2000-05-19 M 4000 -613328421

I want to treat each of these rows as different and want to get unique row_hash for these. How can I achieve that?

CodePudding user response:

Add a unique id firstly. E.g.

val dfy = dfx.withColumn("seqVal", functions.monotonically_increasing_id())

Then apply the hash (and drop that extra column).

Or

Alternatively, replace the null value with a value unlikely to occur in any of the columns under consideration, dynamically or statically, then apply the hash. That said, the 1st option is a blanket consideration.

CodePudding user response:

Here is what I meant by my comment:

scala> df.show()
 ---- ---- ---- --- 
|c1  |c2  |c3  |c4 |
 ---- ---- ---- --- 
|C1  |C2  |null|C4 |
|C1  |null|C2  |C4 |
|null|C1  |C2  |C4 |
 ---- ---- ---- --- 

scala> val df2 = df.withColumn("row_hash",hash(df.columns.map(c => when(col(c).isNull,lit(c)).otherwise(col(c))):_*))

scala> df2.show()
 ---- ---- ---- --- ---------- 
|c1  |c2  |c3  |c4 |row_hash  |
 ---- ---- ---- --- ---------- 
|C1  |C2  |null|C4 |1490822089|
|C1  |null|C2  |C4 |-538395727|
|null|C1  |C2  |C4 |591026130 |
 ---- ---- ---- --- ---------- 
  • Related