I am looping through a bunch of pickle files, doing some calculations and sorting, and then saving the pickle to the same file. It takes about 15ms per iteration, and 180 iterations total. If I tried threading this instead of looping through it, would that mean the entire thing is done in 15ms?
Here is the code:
import pandas as pd
import os
files = os.listdir('folder')
for f in files:
df = pd.read_pickle('folder/' f)
df = df.sort_values(by='time')
df = df.iloc[-100:,:]
df.to_pickle('folder/' f)
now before you just say try it and test the speed - I don't know how to do threading and it will take me a bit to learn so I thought I would just ask instead. I am working on a desktop pc with a intl i3-8109U, which I think has 4 processors? Not sure if that matters.
CodePudding user response:
Here's how you could do this using multiprocessing. If you want to try multithreading just import ThreadPoolExecutor and use that instead of ProcessPoolExecutor. No other code changes would be needed.
import pandas as pd
from concurrent.futures import ProcessPoolExecutor
from glob import glob
import time
def do_work(file):
df = pd.read_pickle(file)
df = df.sort_values(by='time')
df = df.iloc[-100:,:]
df.to_pickle(file)
def main():
start_time = time.perf_counter()
with ProcessPoolExecutor() as executor:
executor.map(do_work, glob('folder/*'))
end_time = time.perf_counter()
print(f'Duration={end_time-start_time:2f} seconds')
if __name__ == '__main__':
main()