multiprocessing inside a for loop-CodePudding

I have read about the multiprocessing package and also the threading module but I am not quite sure how to use it in my case, even though I still think that I could benefit from implementing it.

I'm currently writing a pipeline that processes and scrapes a bunch of HTML files. My cleaning method iterates through all HTML files and processes them, by calling another method that extracts the data and returns a pandas data frame. The cleaning method currently waits for one file to be finished parsing, that's why I think multiprocessing would help here.

I'm not quite sure if threading or multiprocessing is the right choice but I think since the task is CPU-bound multiprocessing should be perfect

This is what my code looks like right now:

def get_clean_df(self):
    # iterate through all existing html files and parse them
    for filepath in glob.glob("../data/source/*/*.html"):
    # expand existing dataframe with the newly parsed result
        result = pd.concat([result, self._extract_df_from_html(filepath)])

return result

thanks for the help guys

CodePudding user response：

According my comments, you can create something like this:

import pandas as pd
import multiprocessing
import glob

def extract_df_from_html(filepath):
    # Do stuff here
    df = pd.DataFrame()
    return df

class Foo():
    def process(self):
        files = glob.glob("../data/source/*/*.html")
        with multiprocessing.Pool(4) as pool:
            result = pool.map(extract_df_from_html, files)
        self.result = pd.concat(result, ignore_index=True)

if __name__ == '__main__':
    foo = Foo()
    foo.process()