Suppose we have the following data:
from pyspark.sql.types import StructType, StructField, StringType
data = [
(("James",None,"Smith"),"OH","M"),
(("Anna","Rose",""),"NY","F"),
(("Julia","","Williams"),"OH","F"),
(("Maria","Anne","Jones"),"NY","M"),
(("Jen","Mary","Brown"),"NY","M"),
(("Mike","Mary","Williams"),"OH","M")
]
schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('state', StringType(), True),
StructField('gender', StringType(), True)
])
df = spark.createDataFrame(data = data, schema = schema)
with the following schema:
root
|-- name: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- lastname: string (nullable = true)
|-- state: string (nullable = true)
|-- gender: string (nullable = true)
So the name
column looks like this:
----------------------
|name |
----------------------
|[James,, Smith] |
|[Anna, Rose, ] |
|[Julia, , Williams] |
|[Maria, Anne, Jones] |
|[Jen, Mary, Brown] |
|[Mike, Mary, Williams]|
----------------------
Is there an easy way to get the hash value of each of the rows in the name
column? Or does hashing only work for unnested data?
CodePudding user response:
In order to create a hash from the struct type column, you first need to convert the struct to e.g. string. to_json
does the job. After that you can use a hash function like md5
.
F.md5(F.to_json('name'))
Using your example df:
df = df.withColumn('md5', F.md5(F.to_json('name')))
df.show(truncate=0)
# ---------------------- ----- ------ --------------------------------
# |name |state|gender|md5 |
# ---------------------- ----- ------ --------------------------------
# |{James, null, Smith} |OH |M |ad4f22b4a03070026957a65b3b8e5bf9|
# |{Anna, Rose, } |NY |F |c8dcb8f6f52c2e382c33bd92819cd500|
# |{Julia, , Williams} |OH |F |63a7c53d21f53e37b3724312b14a8e97|
# |{Maria, Anne, Jones} |NY |M |a0f2d3962be4941828a2b6f4a02d0ac5|
# |{Jen, Mary, Brown} |NY |M |cae64ee19dd2a0c9745a20e759a527e9|
# |{Mike, Mary, Williams}|OH |M |5e882c033be16bd679f450889e97be6d|
# ---------------------- ----- ------ --------------------------------