I have read about the multiprocessing package and also the threading module but I am not quite sure how to use it in my case, even though I still think that I could benefit from implementing it.
I'm currently writing a pipeline that processes and scrapes a bunch of HTML files. My cleaning method iterates through all HTML files and processes them, by calling another method that extracts the data and returns a pandas data frame. The cleaning method currently waits for one file to be finished parsing, that's why I think multiprocessing would help here.
I'm not quite sure if threading or multiprocessing is the right choice but I think since the task is CPU-bound multiprocessing should be perfect
This is what my code looks like right now:
def get_clean_df(self):
# iterate through all existing html files and parse them
for filepath in glob.glob("../data/source/*/*.html"):
# expand existing dataframe with the newly parsed result
result = pd.concat([result, self._extract_df_from_html(filepath)])
return result
thanks for the help guys
CodePudding user response:
According my comments, you can create something like this:
import pandas as pd
import multiprocessing
import glob
def extract_df_from_html(filepath):
# Do stuff here
df = pd.DataFrame()
return df
class Foo():
def process(self):
files = glob.glob("../data/source/*/*.html")
with multiprocessing.Pool(4) as pool:
result = pool.map(extract_df_from_html, files)
self.result = pd.concat(result, ignore_index=True)
if __name__ == '__main__':
foo = Foo()
foo.process()