Copying column name as dictionary key in all values of column in Pyspark dataframe-CodePudding

I have pyspark df, distributed across the cluster as follows:

Name  ID
A     1
B     2
C     3

I want to modify 'ID' column to make all values as python dictionaries with column name as key & value as existing values in column as follows:

Name  TRACEID
A     {ID:1}
B     {ID:2}
C     {ID:3}

How do I achieve this using pyspark code ? I need an efficient solution since it's a big volume distributed df across the cluster. Thanks in advance.

CodePudding user response：

You can first construct a struct from the ID column, and then use the to_json function to convert it to the desired format.

df = df.select('Name', F.to_json(F.struct(F.col('ID'))).alias('TRACEID'))

CodePudding user response：

You can use the create_map function

from pyspark.sql.functions import col, lit, create_map
sparkDF.withColumn("ID_dict", create_map(lit("id"),col("ID"))).show()
#  ---- --- --------- 
# |Name| ID|  ID_dict|
#  ---- --- --------- 
# |   A|  1|{id -> 1}|
# |   B|  2|{id -> 2}|
# |   C|  3|{id -> 3}|
#  ---- --- ---------

Rename/drop columns:

df = sparkDF.withColumn("ID_dict",create_map("id",col("ID"))).drop(col("ID")).withColumnRenamed("ID_dict", "ID")
df.show()
#  ---- --------- 
# |Name|       ID|
#  ---- --------- 
# |   A|{id -> 1}|
# |   B|{id -> 2}|
# |   C|{id -> 3}|
#  ---- --------- 

df.printSchema()
# root
#  |-- Name: string (nullable = true)
#  |-- ID: map (nullable = false)
#  |    |-- key: string
#  |    |-- value: long (valueContainsNull = true)

You get a column with map datatype that's well suited for representing a dictionary.