Home > other >  Slow opening of files in python
Slow opening of files in python

Time:12-02

Currently, I'm writing program which needs to load over 13K "*.json" files of different sizes from few lines to 100K lines. Reading looks like:

[read_one_JSON(p) for p in filenames]

def read_one_JSON(path: str):
    with open(path, encoding='utf-8') as fh:
        data = json.load(fh)
        return File(data["_File__name"], data['_File__statements'], data['_File__matches'])

I load file, pass it into class File and read another file. Currently it takes about 2 minutes 20 seconds. I found out, that when I remove processing of the data into class and make just:

[read_one_JSON(p) for p in filenames]

def read_one_JSON(path: str):
    with open(path, encoding='utf-8') as fh:
        data = json.load(fh)

It reduces time just by 10 seconds to 2 minutes and 10 seconds. So, then I removed also json.load to see what causes the time of reading. So, when leaving just:

[read_one_JSON(p) for p in filenames]

def read_one_JSON(path: str):
    with open(path, encoding='utf-8') as fh:

and not reading the data it still lasts 1 minute 45 seconds. It means, the opening of the files is slow. Is there any way to speed up the opening part of the process, without putting everything into one file or parallelization? It is an option, but I would like to know if there is something else to do about that.

Before, I realised such bottle neck I tried libraries like ujson, orjson, msgspec, but since the opening phase is slow, it made just small differences.

CodePudding user response:

Creating 13000 files in the current directory :

import json

from tqdm import tqdm  # pip install tqdm

for i in tqdm(range(13_000)):
    filename = f"data_{i}.json"
    data = {"filename": filename}
    with open(filename, "w") as file:
        json.dump(data, file)
100%|██████████| 13000/13000 [00:01<00:00, 8183.74it/s]

Which means it ran for less than 2 seconds on my computer. tqdm is just a very simple way to see throughput. The script produced files like :

{"filename": "data_0.json"}

Then reading them :

import json

from tqdm import tqdm  # pip install tqdm

for i in tqdm(range(13_000)):
    filename = f"data_{i}.json"
    with open(filename, "rt") as file:
        data = json.load(file)
print(data)
100%|██████████| 13000/13000 [00:00<00:00, 16472.00it/s]
{'filename': 'data_12999.json'}

Which means that they were all read in less than one second.

Maybe it comes from the size of the files you read. If you have many large files, indeed it will take more time. But your disk does not seem like the only cause for the slowness.

  • Related