How to increase speed of reading excel file using python file object?-CodePudding

I am processing round 2800 excel file using python file object which taking more time to read because of that my tool taking 5 hour to execute so I want to know is there any way to make process faster of reading excel file.

reading file excel file code

import os
path=os.getcwd()
folder=path "\\input"
files = os.listdir(folder)
for file in files:
     _input = folder   '\\'   file
     f=open(_input)
     data=f.read()

CodePudding user response：

Try executing the processing of each Excel in parallel with others, have a look to:

CodePudding user response：

Fundamentally, there are two things you can do: either speed up the processing of each file, or process multiple files simultaneously. The best solution to this depends on why it is taking so long. You could start by looking into if the processing that happens on each file is as fast as it can be.

As for processing in parallel:

If a Python program is taking a long time to run because it's waiting for files to be read and written, it can help to use threading. This will allow one thread to process one file while another thread is waiting for its data to be read or written. Whether or not this will help depends on many factors. If the processing itself accounts for most of the time, it won't help. If file IO accounts for most of the time, it might help. Reading multiple files in parallel won't be faster than reading them sequentially if the hard drive is already serving them as fast as it can. Essentially, threading (in Python) only helps if the computer switches back and forth between waiting for the CPU to finish processing, and then waiting for the hard drive to write, and then waiting for the hard drive to read, etcetera. This is because of the Global Interpreter Lock in Python.

To work around the GIL, we need to use multi-processing, where Python actually launches multiple separate processes. This allows it to use more CPU resources, which can dramatically speed things up. It doesn't come for free, however. Each process takes a lot longer to start up than each thread, and they can't really share much in the way of resources so they will use more memory. Whether or not it's worth it depends on the task at hand.

The easiest (in my opinion) way to use multiple threads or processes in parallel is to use the concurrent library. Assuming we have some function that we want to run on each file:

def process_file(file_path):
    pass #do stuff

Then we can run this sequentially:

for file_name in some_list_of_files:
    process_file(file_name)

... or in parallel either via threads:

import concurrent.futures

number_of_threads = 4
with concurrent.futures.ThreadPoolExecutor(number_of_threads) as executor:
    for file_name in some_array_of_files:
        executor.submit(process_file, file_name)
    executor.shutdown()
print("all done!")

Or with multiprocessing:

if __name__ == "__main__":
    number_of_processes = 4
    with concurrent.futures.ThreadPoolExecutor(number_of_processes) as executor:
        for file_name in some_array_of_files:
            executor.submit(process_file, file_name)
        executor.shutdown()
    print("All done!")

We need the if __name__ == "__main__" bit because the processes that we spin up will actually import the Python file (but the name won't be "__main__"), so we need to stop them from recursively redoing the same work.

Which is faster will depend entirely on the actual work that needs doing. Sometimes it's faster to just do it sequentially in the main thread like in "normal" code.