I have a very large json file that is in the form of multiple objects, for small dataset, this works
data=pd.read_json(file,lines=True)
but on the same but larger dataset it would crash on 8gb ram computer, so i tried to convert it to list first with below code
data[]
with open(file) as file:
for i in file:
d = json.loads(i)
data.append(d)'
then convert the list into dataframe with
df = pd.DataFrame(data)
this does convert it into a list fine even with the large dataset file, but it crashes when i try to convert it into a dataframe due to it using to much memory i presume
i have tried doing
data[]
with open(file) as file:
for i in file:
d = json.loads(i)
df=pd.DataFrame([d])'
I thought it would append it one by one but i think it still create one large copy in memory at once insteads, so it still crashes
how would i convert the large json file into dataframe by chuncks so it limit the memory useage?
CodePudding user response:
There are several possible solution, depending on your specific case. Given we don't have a data example or information on the data structure, I could offer the following:
- If the data in the json file is numeric, consider breaking it into chunks, reading each one and converting to the smallest type (float32/int), as pandas will use float64 which is more memory intensive
- use Dask for bigger-than-memory datasets, like you have.
CodePudding user response:
To avoid the intermediate data structures you can use a generator.
def load_jsonl(filename):
with open(filename) as fd:
for line in fd:
yield json.loads(line)
df = pd.DataFrame(load_jsonl(filename))