Home > Software design >  PySpark dataframe transformation - to get value part from JSON
PySpark dataframe transformation - to get value part from JSON

Time:12-22

I have a dataframe like below in which it has both key and value pairs.

| Name |  Age | Location
| abc  |  12  | loc1
| def  |  13  | loc2

I wanted to create a json string out of it. When I do a .toJSON() and collect I am getting the below json string from that dataframe

[{"name":"abc","age":12,"location":"loc1"},{"name":"def","age":13,"location":"loc2"}]

Is it possible to just get the value part of json instead of getting keys like below ?

Expected output:

[{"abc",12,"loc1"},{"def",13,"loc2"}]

Code I am using to convert dataframe to json

df.toJSON().collect()

It would be great if we can get some solutions in PYSPARK.

CodePudding user response:

The result you expect is not a valid JSON, so you can't use toJSON() on dataframe or to_json function.

You can try something like this using format_string function to get the structure you want:

from pyspark.sql import functions as F

result = df.groupBy().agg(
    F.collect_list(
        F.format_string('{"%s", %s, "%s"}', *df.columns)
    ).alias("records")
).first()["records"]

print(result)
# ['{"abc", 12, "loc1"}', '{"def", 13, "loc2"}']
  • Related