I have a PySpark dataframe where columns have JSON string values like this:
col1 col2
{"d1":"2343","v1":"3434"} {"id1":"123"}
{"d1":"2344","v1":"3435"} {"id1":"124"}
I want to update "col1" JSON string values with "col2" JSON string values to get this:
col1 col2
{"d1":"2343","v1":"3434","id1":"123"} {"id1":"123"}
{"d1":"2344","v1":"3435","id1":"124"} {"id1":"124"}
How to do this in PySpark?
CodePudding user response:
Since you're dealing with string type columns, you can remove the last }
from "col1", remove the first {
from "col2" and join the strings together with comma ,
as delimiter.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('{"d1":"2343","v1":"3434"}', '{"id1":"123"}'),
('{"d1":"2344","v1":"3435"}', '{"id1":"124"}')],
["col1", "col2"])
Script:
df = df.withColumn(
"col1",
F.concat_ws(
",",
F.regexp_replace("col1", r"}$", ""),
F.regexp_replace("col2", r"^\{", "")
)
)
df.show(truncate=0)
# ------------------------------------- -------------
# |col1 |col2 |
# ------------------------------------- -------------
# |{"d1":"2343","v1":"3434","id1":"123"}|{"id1":"123"}|
# |{"d1":"2344","v1":"3435","id1":"124"}|{"id1":"124"}|
# ------------------------------------- -------------