Home > OS >  How to read back a dataframe dumped as json file using pandas?
How to read back a dataframe dumped as json file using pandas?

Time:10-05

Using pandas version 1.4.3 and python 3.8 on Ubuntu.

I am dumping some training dataframe for a machine learning model using json orient format (which I find most intuitive on a text editor), but cannot read it back using pandas. Here is the example code.

filename='train_data.json'
train_data.to_json(path_or_buf=filename, orient='records', lines=True, indent=4)
pd.read_json(path_or_buf=filename, orient='records') # Throws error 

How to read it back as the exact same dataframe (except the index, and I do not want any extra unnamed column) using pandas?

Also, one more constraint is, the lines and indent arguments while dumping are meant for visual aid, not to alter the parsing of the file by pandas. Is it possible to not make any assumption about those arguments while reading back, i.e. decouple the read method from visual aspects of the file?

Asking this because in production, likely the reading and writing will happen in different processes.

CodePudding user response:

The issue is that lines=True combined with indent=4 creates an invalid JSON.

Note that adding lines=True to your read_json call will have the reader assume one JSON object per line, so this will not work if you keep indent=4. read_json has no indent argument yet, unfortunately.

You can't have it both ways here, you can either have lines=True or indent=4, but not both to still be able to read your JSON.

If you open the file, you'll see it is invalid:


    {
        "0":1,
        "1":2,
        "2":3
    }

    {
        "0":"a",
        "1":"b",
        "2":"c"
    }

I suggest for legibility you stick with just indent:

import pandas as pd

train_data=pd.DataFrame([[1,2,3], ['a','b','c']])
filename='train_data.json'
train_data.to_json(path_or_buf=filename, orient='records', indent=4)
read_in_train_data = pd.read_json(path_or_buf=filename, orient='records')
assert read_in_train_data.equals(train_data)

Reading the dataframe works and you can see your JSON file is also now well-formatted:

[
    {
        "0":1,
        "1":2,
        "2":3
    },
    {
        "0":"a",
        "1":"b",
        "2":"c"
    }
]

Finally, to exclude the index, we would ordinarily add argument index=False to your df.to_json(), but there is no need for this here as you are using the records orient.

CodePudding user response:

Your problem involves how pandas writes and reads the json file.

According to their documentation (see here), one must be cautious about which json format (orientation type) will be applied in the write/read pandas methods.

If you use the orient='records', there will be several constraints that will cause the problem you presented in your question. They can surely be solved by a logical reading function, though the effort is meaningless since there are smarter and more ready-to-use ways for your problem.

Instead of addressing all the constraints caused by this orientation choice, you can choose another orientation for the read/write pandas functions so that the pandas.DataFrame index and columns are all preserved. For such, I suggest you use the default value (i.e, "index"), which does exactly what you need (even with the indentation that you required for visual inspection).

Below is a snippet that depicts each orientation approach.



import pandas as pd
import os

JSONFILE = os.path.join(os.getcwd(), "temp.json")

def writeToJson(df:pd.DataFrame, lines=False):
    
    """
    using the to_json with the option lines=True:
    the resultant file format will be:    
    '''        
        {
            "col 1":"a",
            "col 2":"b"
        }
    
        {
            "col 1":"c",
            "col 2":"d"
        }
    '''

        
    using the to_json with the option lines=False (i.e., the default) the resultant file format will be:    
    '''
        [
            {
                "col 1":"a",
                "col 2":"b"
            },
            {
                "col 1":"c",
                "col 2":"d"
            }
        ]
    '''    
    
    """
    
    df.to_json(JSONFILE, orient='index', indent=4, lines=lines)

def readJson():
    return pd.read_json(JSONFILE, orient='index')
    
if "__main__" == __name__:
    df = pd.DataFrame(
    [["a", "b"], ["c", "d"]],
    index=["row 1", "row 2"],
    columns=["col 1", "col 2"],
)
    writeToJson(df)
    
    df2 = readJson()

  • Related