I have a dataframe like below in which it has both key and value pairs.
| Name | Age | Location
| abc | 12 | loc1
| def | 13 | loc2
I wanted to create a json string out of it. When I do a .toJSON()
and collect I am getting the below json string from that dataframe
[{"name":"abc","age":12,"location":"loc1"},{"name":"def","age":13,"location":"loc2"}]
Is it possible to just get the value part of json instead of getting keys like below ?
Expected output:
[{"abc",12,"loc1"},{"def",13,"loc2"}]
Code I am using to convert dataframe to json
df.toJSON().collect()
It would be great if we can get some solutions in PYSPARK.
CodePudding user response:
The result you expect is not a valid JSON, so you can't use toJSON()
on dataframe or to_json
function.
You can try something like this using format_string
function to get the structure you want:
from pyspark.sql import functions as F
result = df.groupBy().agg(
F.collect_list(
F.format_string('{"%s", %s, "%s"}', *df.columns)
).alias("records")
).first()["records"]
print(result)
# ['{"abc", 12, "loc1"}', '{"def", 13, "loc2"}']