How to load large json(multiple object) into pandas dataframes in chuncks to avoid high memory usage-CodePudding

I have a very large json file that is in the form of multiple objects, for small dataset, this works

data=pd.read_json(file,lines=True)

but on the same but larger dataset it would crash on 8gb ram computer, so i tried to convert it to list first with below code

data[]
with open(file) as file:
    for i in file:
        d = json.loads(i)
        data.append(d)'

then convert the list into dataframe with

df = pd.DataFrame(data)

this does convert it into a list fine even with the large dataset file, but it crashes when i try to convert it into a dataframe due to it using to much memory i presume

i have tried doing

data[]
with open(file) as file:
    for i in file:
        d = json.loads(i)
        df=pd.DataFrame([d])'

I thought it would append it one by one but i think it still create one large copy in memory at once insteads, so it still crashes

how would i convert the large json file into dataframe by chuncks so it limit the memory useage?

CodePudding user response：

There are several possible solution, depending on your specific case. Given we don't have a data example or information on the data structure, I could offer the following:

If the data in the json file is numeric, consider breaking it into chunks, reading each one and converting to the smallest type (float32/int), as pandas will use float64 which is more memory intensive
use Dask for bigger-than-memory datasets, like you have.

CodePudding user response：

To avoid the intermediate data structures you can use a generator.

def load_jsonl(filename):
    with open(filename) as fd:
        for line in fd:
            yield json.loads(line)

df = pd.DataFrame(load_jsonl(filename))