Reading json file into pandas dataframe is very slow-CodePudding

I have a json file of size less than 1Gb.I am trying to read the file on a server that have 400 Gb RAM using the following simple command:

df = pd.read_json('filepath.json')

However this code is taking forever (several hours) to execute,I tried several suggestions such as

df = pd.read_json('filepath.json', low_memory=False)

df = pd.read_json('filepath.json', lines=True)

But none have worked. How come reading 1GB file into a server of 400GB be so slow?

CodePudding user response：

You can use Chunking can shrink memory use. I recommend Dask Library can load data in parallel.

CodePudding user response：

JSON module, then into Pandas

import json
import pandas as pd

data = json.load(open("your_file.json", "r"))
df = pd.DataFrame.from_dict(data, orient="index")

Directly using Pandas

df = pd.read_json("test.json", orient="records", lines=True, chunksize=5)

You said this option gives you a memory error, but there is an option that should help with it. Passing lines=True and then specify how many lines to read in one chunk by using the chunk size argument. The following will return an object that you can iterate over, and each iteration will read only 5