I have a json file of size less than 1Gb.I am trying to read the file on a server that have 400 Gb RAM using the following simple command:
df = pd.read_json('filepath.json')
However this code is taking forever (several hours) to execute,I tried several suggestions such as
df = pd.read_json('filepath.json', low_memory=False)
or
df = pd.read_json('filepath.json', lines=True)
But none have worked. How come reading 1GB file into a server of 400GB be so slow?
CodePudding user response:
You can use Chunking can shrink memory use. I recommend Dask Library can load data in parallel.
CodePudding user response:
JSON module, then into Pandas
import json
import pandas as pd
data = json.load(open("your_file.json", "r"))
df = pd.DataFrame.from_dict(data, orient="index")
Directly using Pandas
df = pd.read_json("test.json", orient="records", lines=True, chunksize=5)
You said this option gives you a memory error, but there is an option that should help with it. Passing lines=True
and then specify how many lines to read in one chunk by using the chunk size argument. The following will return an object that you can iterate over, and each iteration will read only 5