I have an column of type binary
. The values are 4 bytes long, and I would like to interpret them as an Int. An example DataFrame looks like this:
val df = Seq(
(Array(0x00.toByte, 0x00.toByte, 0x02.toByte, 0xe6.toByte))
).toDF("binary_value")
Where the 4 bytes in this example can be interpreted as an U32 to form the number 742. Using a UDF the value can be decoded like this:
val bytesToInt = udf((x: Array[Byte]) => BigInt(x).toInt)
df.withColumn("numerical_value", bytesToInt('binary_value))
It works, but at the cost of using a UDF and corresponding serialization / deserialization overhead. I was hoping to do something like 'binary_value.cast("array<byte>")
and take it from there, or even 'binary_value.cast("int")
, but Spark doesn't allow it.
Is there a way to interpret the binary column to another data type using Spark native functions?
CodePudding user response:
One way could be converting to hex (using hex
) and then to dec (using conv
).
conv(hex($"binary_value"), 16, 10)
df.withColumn("numerical_value", conv(hex($"binary_value"), 16, 10)).show()
// ------------- ---------------
// | binary_value|numerical_value|
// ------------- ---------------
// |[00 00 02 E6]| 742|
// ------------- ---------------