Using pandas version 1.4.3 and python 3.8 on Ubuntu.
I am dumping some training dataframe for a machine learning model using json orient format (which I find most intuitive on a text editor), but cannot read it back using pandas. Here is the example code.
filename='train_data.json'
train_data.to_json(path_or_buf=filename, orient='records', lines=True, indent=4)
pd.read_json(path_or_buf=filename, orient='records') # Throws error
How to read it back as the exact same dataframe (except the index, and I do not want any extra unnamed column) using pandas?
Also, one more constraint is, the lines
and indent
arguments while dumping are meant for visual aid, not to alter the parsing of the file by pandas. Is it possible to not make any assumption about those arguments while reading back, i.e. decouple the read method from visual aspects of the file?
Asking this because in production, likely the reading and writing will happen in different processes.
CodePudding user response:
The issue is that lines=True
combined with indent=4
creates an invalid JSON.
Note that adding lines=True
to your read_json
call will have the reader assume one JSON object per line, so this will not work if you keep indent=4
. read_json
has no indent
argument yet, unfortunately.
You can't have it both ways here, you can either have lines=True
or indent=4
, but not both to still be able to read your JSON.
If you open the file, you'll see it is invalid:
{
"0":1,
"1":2,
"2":3
}
{
"0":"a",
"1":"b",
"2":"c"
}
I suggest for legibility you stick with just indent
:
import pandas as pd
train_data=pd.DataFrame([[1,2,3], ['a','b','c']])
filename='train_data.json'
train_data.to_json(path_or_buf=filename, orient='records', indent=4)
read_in_train_data = pd.read_json(path_or_buf=filename, orient='records')
assert read_in_train_data.equals(train_data)
Reading the dataframe works and you can see your JSON file is also now well-formatted:
[
{
"0":1,
"1":2,
"2":3
},
{
"0":"a",
"1":"b",
"2":"c"
}
]
Finally, to exclude the index, we would ordinarily add argument index=False
to your df.to_json()
, but there is no need for this here as you are using the records
orient.
CodePudding user response:
Your problem involves how pandas writes and reads the json file.
According to their documentation (see here), one must be cautious about which json format (orientation type) will be applied in the write/read pandas methods.
If you use the orient='records', there will be several constraints that will cause the problem you presented in your question. They can surely be solved by a logical reading function, though the effort is meaningless since there are smarter and more ready-to-use ways for your problem.
Instead of addressing all the constraints caused by this orientation choice, you can choose another orientation for the read/write pandas functions so that the pandas.DataFrame index and columns are all preserved. For such, I suggest you use the default value (i.e, "index"), which does exactly what you need (even with the indentation that you required for visual inspection).
Below is a snippet that depicts each orientation approach.
import pandas as pd
import os
JSONFILE = os.path.join(os.getcwd(), "temp.json")
def writeToJson(df:pd.DataFrame, lines=False):
"""
using the to_json with the option lines=True:
the resultant file format will be:
'''
{
"col 1":"a",
"col 2":"b"
}
{
"col 1":"c",
"col 2":"d"
}
'''
using the to_json with the option lines=False (i.e., the default) the resultant file format will be:
'''
[
{
"col 1":"a",
"col 2":"b"
},
{
"col 1":"c",
"col 2":"d"
}
]
'''
"""
df.to_json(JSONFILE, orient='index', indent=4, lines=lines)
def readJson():
return pd.read_json(JSONFILE, orient='index')
if "__main__" == __name__:
df = pd.DataFrame(
[["a", "b"], ["c", "d"]],
index=["row 1", "row 2"],
columns=["col 1", "col 2"],
)
writeToJson(df)
df2 = readJson()