I have pyspark df, distributed across the cluster as follows:
Name ID
A 1
B 2
C 3
I want to modify 'ID' column to make all values as python dictionaries with column name as key & value as existing values in column as follows:
Name TRACEID
A {ID:1}
B {ID:2}
C {ID:3}
How do I achieve this using pyspark code ? I need an efficient solution since it's a big volume distributed df across the cluster. Thanks in advance.
CodePudding user response:
You can first construct a struct
from the ID
column, and then use the to_json
function to convert it to the desired format.
df = df.select('Name', F.to_json(F.struct(F.col('ID'))).alias('TRACEID'))
CodePudding user response:
You can use the create_map
function
from pyspark.sql.functions import col, lit, create_map
sparkDF.withColumn("ID_dict", create_map(lit("id"),col("ID"))).show()
# ---- --- ---------
# |Name| ID| ID_dict|
# ---- --- ---------
# | A| 1|{id -> 1}|
# | B| 2|{id -> 2}|
# | C| 3|{id -> 3}|
# ---- --- ---------
Rename/drop columns:
df = sparkDF.withColumn("ID_dict",create_map("id",col("ID"))).drop(col("ID")).withColumnRenamed("ID_dict", "ID")
df.show()
# ---- ---------
# |Name| ID|
# ---- ---------
# | A|{id -> 1}|
# | B|{id -> 2}|
# | C|{id -> 3}|
# ---- ---------
df.printSchema()
# root
# |-- Name: string (nullable = true)
# |-- ID: map (nullable = false)
# | |-- key: string
# | |-- value: long (valueContainsNull = true)
You get a column with map datatype that's well suited for representing a dictionary.