Home > Back-end >  How to pass table column to rand function using spark.sql?
How to pass table column to rand function using spark.sql?

Time:01-04

I have one table with two columns A and B in hive metadata. I have to generate same random number when value of A and B pair is identical to other records.

Example: value pair 2.0 & 3.0 or 1.0 & 5.0 on those records using HASH function I will get hash_code. This hash_code I will pass to random function to get specific value for each matched records. I can simply pass seed=123 to rand function but I am not able to pass table column to rand function.

enter image description here

Edit 1:

Function rand(123) with same seed will produce identical results.

enter image description here

Passing hash to rand:

spark.sql("select *,rand(hash(A,B)) from table1").show()

Getting below error:

AnalysisException: Input argument to rand must be an integer, long, or null constant.

How to pass hash_code to rand function using spark.sql?

CodePudding user response:

As also pointed out by some comments, rand accepts only one parameter, the seed, which should be a constant, not an column (which is what you obtain from hash(A, B)).

If your purpose is producing a key from the columns A and B, then you shouldn't call any randomizing function. Just use the value of the hash.

CodePudding user response:

Following on my comments, what you're looking for is simply not possible (at least for now) using Spark, for 2 mainly reasons:

  • Function rand can take only constant parameters
  • Also, it's a non-deterministic function, so calling rand(hash(A,B)) in your dataframe won't give same result for same inputs A and B:
import pyspark.sql.functions as F

# the function rand is called with same value 123, still give different results
spark.range(3).withColumn("rand", F.rand(123)).show()

# --- ------------------- 
#| id|               rand|
# --- ------------------- 
#|  0|0.24244888714603952|
#|  1| 0.4745014193615499|
#|  2|0.03951602781768582|
# --- ------------------- 

That said, if your intent is to get a value between [0, 1] from the hash result of A and B, then you could use this trick by dividing the hash by 10...0length_of_hash:

spark.sql("""
SELECT  A, 
        B,  
        hash(A,B) / rpad('1', length(hash(A,B))   1, '0') AS Id  
FROM    table1
""").show()

# --- --- ------------ 
#|  A|  B|          Id|
# --- --- ------------ 
#|2.0|3.0|0.1475353518|
#|1.0|5.0| 0.649463331|
#|2.0|3.0|0.1475353518|
#|1.0|5.0| 0.649463331|
# --- --- ------------ 
  •  Tags:  
  • Related