Home > Software engineering >  Hash of nested (struct) data
Hash of nested (struct) data

Time:07-26

Suppose we have the following data:

from pyspark.sql.types import StructType, StructField, StringType        

data = [
        (("James",None,"Smith"),"OH","M"),
        (("Anna","Rose",""),"NY","F"),
        (("Julia","","Williams"),"OH","F"),
        (("Maria","Anne","Jones"),"NY","M"),
        (("Jen","Mary","Brown"),"NY","M"),
        (("Mike","Mary","Williams"),"OH","M")
        ]
schema = StructType([
    StructField('name', StructType([
         StructField('firstname', StringType(), True),
         StructField('middlename', StringType(), True),
         StructField('lastname', StringType(), True)
         ])),
     StructField('state', StringType(), True),
     StructField('gender', StringType(), True)
     ])

df = spark.createDataFrame(data = data, schema = schema)

with the following schema:

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- state: string (nullable = true)
 |-- gender: string (nullable = true)

So the name column looks like this:

 ---------------------- 
|name                  |
 ---------------------- 
|[James,, Smith]       |
|[Anna, Rose, ]        |
|[Julia, , Williams]   |
|[Maria, Anne, Jones]  |
|[Jen, Mary, Brown]    |
|[Mike, Mary, Williams]|
 ---------------------- 

Is there an easy way to get the hash value of each of the rows in the name column? Or does hashing only work for unnested data?

CodePudding user response:

In order to create a hash from the struct type column, you first need to convert the struct to e.g. string. to_json does the job. After that you can use a hash function like md5.

F.md5(F.to_json('name'))

Using your example df:

df = df.withColumn('md5', F.md5(F.to_json('name')))
df.show(truncate=0)
#  ---------------------- ----- ------ -------------------------------- 
# |name                  |state|gender|md5                             |
#  ---------------------- ----- ------ -------------------------------- 
# |{James, null, Smith}  |OH   |M     |ad4f22b4a03070026957a65b3b8e5bf9|
# |{Anna, Rose, }        |NY   |F     |c8dcb8f6f52c2e382c33bd92819cd500|
# |{Julia, , Williams}   |OH   |F     |63a7c53d21f53e37b3724312b14a8e97|
# |{Maria, Anne, Jones}  |NY   |M     |a0f2d3962be4941828a2b6f4a02d0ac5|
# |{Jen, Mary, Brown}    |NY   |M     |cae64ee19dd2a0c9745a20e759a527e9|
# |{Mike, Mary, Williams}|OH   |M     |5e882c033be16bd679f450889e97be6d|
#  ---------------------- ----- ------ -------------------------------- 
  • Related