Decode binary data in Spark with native column functions-CodePudding

I have an column of type binary. The values are 4 bytes long, and I would like to interpret them as an Int. An example DataFrame looks like this:

val df = Seq(
  (Array(0x00.toByte, 0x00.toByte, 0x02.toByte, 0xe6.toByte))
  ).toDF("binary_value")

Where the 4 bytes in this example can be interpreted as an U32 to form the number 742. Using a UDF the value can be decoded like this:

val bytesToInt = udf((x: Array[Byte]) => BigInt(x).toInt)

df.withColumn("numerical_value", bytesToInt('binary_value))

It works, but at the cost of using a UDF and corresponding serialization / deserialization overhead. I was hoping to do something like 'binary_value.cast("array<byte>") and take it from there, or even 'binary_value.cast("int"), but Spark doesn't allow it.

Is there a way to interpret the binary column to another data type using Spark native functions?

CodePudding user response：

One way could be converting to hex (using hex) and then to dec (using conv).

conv(hex($"binary_value"), 16, 10)

df.withColumn("numerical_value", conv(hex($"binary_value"), 16, 10)).show()
//  ------------- --------------- 
// | binary_value|numerical_value|
//  ------------- --------------- 
// |[00 00 02 E6]|            742|
//  ------------- ---------------