Home > database >  Decode binary data in Spark with native column functions
Decode binary data in Spark with native column functions

Time:07-01

I have an column of type binary. The values are 4 bytes long, and I would like to interpret them as an Int. An example DataFrame looks like this:

val df = Seq(
  (Array(0x00.toByte, 0x00.toByte, 0x02.toByte, 0xe6.toByte))
  ).toDF("binary_value")

Where the 4 bytes in this example can be interpreted as an U32 to form the number 742. Using a UDF the value can be decoded like this:

val bytesToInt = udf((x: Array[Byte]) => BigInt(x).toInt)

df.withColumn("numerical_value", bytesToInt('binary_value))

It works, but at the cost of using a UDF and corresponding serialization / deserialization overhead. I was hoping to do something like 'binary_value.cast("array<byte>") and take it from there, or even 'binary_value.cast("int"), but Spark doesn't allow it.

Is there a way to interpret the binary column to another data type using Spark native functions?

CodePudding user response:

One way could be converting to hex (using hex) and then to dec (using conv).

conv(hex($"binary_value"), 16, 10)
df.withColumn("numerical_value", conv(hex($"binary_value"), 16, 10)).show()
//  ------------- --------------- 
// | binary_value|numerical_value|
//  ------------- --------------- 
// |[00 00 02 E6]|            742|
//  ------------- --------------- 
  • Related