Home > database >  Update column value of type JSON string with another JSON string value from different column
Update column value of type JSON string with another JSON string value from different column

Time:10-05

I have a PySpark dataframe where columns have JSON string values like this:

col1                          col2
{"d1":"2343","v1":"3434"} {"id1":"123"}
{"d1":"2344","v1":"3435"} {"id1":"124"}

I want to update "col1" JSON string values with "col2" JSON string values to get this:

col1                                     col2
{"d1":"2343","v1":"3434","id1":"123"}    {"id1":"123"}
{"d1":"2344","v1":"3435","id1":"124"}    {"id1":"124"}

How to do this in PySpark?

CodePudding user response:

Since you're dealing with string type columns, you can remove the last } from "col1", remove the first { from "col2" and join the strings together with comma , as delimiter.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('{"d1":"2343","v1":"3434"}', '{"id1":"123"}'),
     ('{"d1":"2344","v1":"3435"}', '{"id1":"124"}')],
    ["col1", "col2"])

Script:

df = df.withColumn(
    "col1",
    F.concat_ws(
        ",",
        F.regexp_replace("col1", r"}$", ""),
        F.regexp_replace("col2", r"^\{", "")
    )
)

df.show(truncate=0)
#  ------------------------------------- ------------- 
# |col1                                 |col2         |
#  ------------------------------------- ------------- 
# |{"d1":"2343","v1":"3434","id1":"123"}|{"id1":"123"}|
# |{"d1":"2344","v1":"3435","id1":"124"}|{"id1":"124"}|
#  ------------------------------------- ------------- 
  • Related