How to pass table column to rand function using spark.sql?-CodePudding

I have one table with two columns A and B in hive metadata. I have to generate same random number when value of A and B pair is identical to other records.

Example: value pair 2.0 & 3.0 or 1.0 & 5.0 on those records using HASH function I will get hash_code. This hash_code I will pass to random function to get specific value for each matched records. I can simply pass seed=123 to rand function but I am not able to pass table column to rand function.

Edit 1:

Function rand(123) with same seed will produce identical results.

Passing hash to rand:

spark.sql("select *,rand(hash(A,B)) from table1").show()

Getting below error:

AnalysisException: Input argument to rand must be an integer, long, or null constant.

How to pass hash_code to rand function using spark.sql?

CodePudding user response：

As also pointed out by some comments, rand accepts only one parameter, the seed, which should be a constant, not an column (which is what you obtain from hash(A, B)).

If your purpose is producing a key from the columns A and B, then you shouldn't call any randomizing function. Just use the value of the hash.

CodePudding user response：

Following on my comments, what you're looking for is simply not possible (at least for now) using Spark, for 2 mainly reasons:

Function rand can take only constant parameters
Also, it's a non-deterministic function, so calling rand(hash(A,B)) in your dataframe won't give same result for same inputs A and B:

import pyspark.sql.functions as F

# the function rand is called with same value 123, still give different results
spark.range(3).withColumn("rand", F.rand(123)).show()

# --- ------------------- 
#| id|               rand|
# --- ------------------- 
#|  0|0.24244888714603952|
#|  1| 0.4745014193615499|
#|  2|0.03951602781768582|
# --- -------------------

That said, if your intent is to get a value between [0, 1] from the hash result of A and B, then you could use this trick by dividing the hash by 10...0length_of_hash:

spark.sql("""
SELECT  A, 
        B,  
        hash(A,B) / rpad('1', length(hash(A,B))   1, '0') AS Id  
FROM    table1
""").show()

# --- --- ------------ 
#|  A|  B|          Id|
# --- --- ------------ 
#|2.0|3.0|0.1475353518|
#|1.0|5.0| 0.649463331|
#|2.0|3.0|0.1475353518|
#|1.0|5.0| 0.649463331|
# --- --- ------------