I have a dataframe
import pandas as pd
data = {
"ID": [123123, 222222, 333333],
"Main Authors": ["[Jim Allen, Tim H]", "[Rob Garder, Harry S, Tim H]", "[Wo Shu, Tee Ru, Fuu Wan, Gee Han]"],
"Abstract": ["This is paper about hehe", "This paper is very nice", "Hello there paper from kellogs"],
"paper IDs": ["[123768, 123123]", "[123432, 34345, 353545, 454545]", "[123123, 3433434, 55656655, 988899]"],
}
and I am trying to export it to a JSON schema. I do so via
df.to_json(orient='records')
'[{"ID":123123,"Main Authors":"[Jim Allen, Tim H]","Abstract":"This is paper about hehe","paper IDs":"[123768, 123123]"},
{"ID":222222,"Main Authors":"[Rob Garder, Harry S, Tim H]","Abstract":"This paper is very nice","paper IDs":"[123432, 34345, 353545, 454545]"},
{"ID":333333,"Main Authors":"[Wo Shu, Tee Ru, Fuu Wan, Gee Han]","Abstract":"Hello there paper from kellogs","paper IDs":"[123123, 3433434, 55656655, 988899]"}]'
but this is not in the right format for JSON. How can I get my output to look like this
{"ID": "123123", "Main Authors": ["Jim Allen", "Tim H"], "Abstract": "This is paper about hehe", "paper IDs": ["123768", "123123"]}
{and so on for paper 2...}
I can't find an easy way to achieve this schema with the basic functions.
CodePudding user response:
to_json
returns a proper JSON document. What you want is not a JSON document.
Add lines=True
to the call:
df.to_json(orient='records', lines=True)
The output you desire is not valid JSON. It's a very common way to stream JSON objects though: write one unindented JSON object per line.
Streaming JSON is an old technique, used to write JSON records to logs, send them over the network etc. There's no specification for this, but a lot of people tried to hijack it, even creating sites that mirrored Douglas Crockford's original JSON site, or mimicking the language of RFCs.
Streaming JSON formats are used a lot in IoT and event processing applications, where events will arrive over a long period of time.
PS: I remembered I saw a few months ago a question about json-seq
. Seems there was an attempt to standardize streaming JSON RFC 7464 as JSON Sequences, using the mime type application/json-seq
.
CodePudding user response:
You can convert DataFrame
to list of dictionaries first.
import pandas as pd
data = {
"ID": [123123, 222222, 333333],
"Main Authors": [["Jim Allen", "Tim H"], ["Rob Garder", "Harry S", "Tim H"], ["Wo Shu", "Tee Ru", "Fuu Wan", "Gee Han"]],
"Abstract": ["This is paper about hehe", "This paper is very nice", "Hello there paper from kellogs"],
"paper IDs": [[123768, 123123], [123432, 34345, 353545, 454545], [123123, 3433434, 55656655, 988899]],
}
df = pd.DataFrame(data)
df.to_dict('records')
The result:
[{'ID': 123123,
'Main Authors': ['Jim Allen', 'Tim H'],
'Abstract': 'This is paper about hehe',
'paper IDs': [123768, 123123]},
{'ID': 222222,
'Main Authors': ['Rob Garder', 'Harry S', 'Tim H'],
'Abstract': 'This paper is very nice',
'paper IDs': [123432, 34345, 353545, 454545]},
{'ID': 333333,
'Main Authors': ['Wo Shu', 'Tee Ru', 'Fuu Wan', 'Gee Han'],
'Abstract': 'Hello there paper from kellogs',
'paper IDs': [123123, 3433434, 55656655, 988899]}]
Is that what you are looking for?