We have below dataframe in Spark Scala:
firstname | middlename | lastname | dob | gender | salary |
---|---|---|---|---|---|
Michael | Rose | null | 2000-05-19 | M | 4000 |
Michael | null | Rose | 2000-05-19 | M | 4000 |
null | Michael | Rose | 2000-05-19 | M | 4000 |
Here, we want to create unique row_hash
for each row's data in other dataframe.
So, applying below transformation:
val df2 = df.withColumn("row_hash", hash(df.columns.map(col):_*))
And getting below:
firstname | middlename | lastname | dob | gender | salary | row_hash |
---|---|---|---|---|---|---|
Michael | Rose | null | 2000-05-19 | M | 4000 | -613328421 |
Michael | null | Rose | 2000-05-19 | M | 4000 | -613328421 |
null | Michael | Rose | 2000-05-19 | M | 4000 | -613328421 |
I want to treat each of these rows as different and want to get unique row_hash
for these. How can I achieve that?
CodePudding user response:
Add a unique id firstly. E.g.
val dfy = dfx.withColumn("seqVal", functions.monotonically_increasing_id())
Then apply the hash (and drop that extra column).
Or
Alternatively, replace the null value with a value unlikely to occur in any of the columns under consideration, dynamically or statically, then apply the hash. That said, the 1st option is a blanket consideration.
CodePudding user response:
Here is what I meant by my comment:
scala> df.show()
---- ---- ---- ---
|c1 |c2 |c3 |c4 |
---- ---- ---- ---
|C1 |C2 |null|C4 |
|C1 |null|C2 |C4 |
|null|C1 |C2 |C4 |
---- ---- ---- ---
scala> val df2 = df.withColumn("row_hash",hash(df.columns.map(c => when(col(c).isNull,lit(c)).otherwise(col(c))):_*))
scala> df2.show()
---- ---- ---- --- ----------
|c1 |c2 |c3 |c4 |row_hash |
---- ---- ---- --- ----------
|C1 |C2 |null|C4 |1490822089|
|C1 |null|C2 |C4 |-538395727|
|null|C1 |C2 |C4 |591026130 |
---- ---- ---- --- ----------