Update column value of type JSON string with another JSON string value from different column-CodePudding

I have a PySpark dataframe where columns have JSON string values like this:

col1                          col2
{"d1":"2343","v1":"3434"} {"id1":"123"}
{"d1":"2344","v1":"3435"} {"id1":"124"}

I want to update "col1" JSON string values with "col2" JSON string values to get this:

col1                                     col2
{"d1":"2343","v1":"3434","id1":"123"}    {"id1":"123"}
{"d1":"2344","v1":"3435","id1":"124"}    {"id1":"124"}

How to do this in PySpark?

CodePudding user response：

Since you're dealing with string type columns, you can remove the last } from "col1", remove the first { from "col2" and join the strings together with comma , as delimiter.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('{"d1":"2343","v1":"3434"}', '{"id1":"123"}'),
     ('{"d1":"2344","v1":"3435"}', '{"id1":"124"}')],
    ["col1", "col2"])

Script:

df = df.withColumn(
    "col1",
    F.concat_ws(
        ",",
        F.regexp_replace("col1", r"}$", ""),
        F.regexp_replace("col2", r"^\{", "")
    )
)

df.show(truncate=0)
#  ------------------------------------- ------------- 
# |col1                                 |col2         |
#  ------------------------------------- ------------- 
# |{"d1":"2343","v1":"3434","id1":"123"}|{"id1":"123"}|
# |{"d1":"2344","v1":"3435","id1":"124"}|{"id1":"124"}|
#  ------------------------------------- -------------