while parsing with bs4,lxml and looping trough my files with ThreadPoolExecutor threading I am experiencing really slow results. I have searched the whole internet for faster alternatives on this one. The parsing of about 2000 cached files (1.2mb each) takes about 15 minutes (max_workes=500) on ThreadPoolExecutor. I even tried parsing on Amazon AWS with 64 vCPU but the speed remains about the same.
I want to parse about 100k files which will takes hours of parsing. Why isn't the parsing not efficiently speeding up while multiprocessing? One file takes about 2seconds. Why is the speed of 10 files with (max_workes=10) not equaling 2 seconds as well since the threads are concurrent? Ok maybe 3 seconds would be fine. But it takes ages the more files there are, the more workers I assign to the threads. It get's to the point of about ~ 25 seconds per file instead of 2 seconds when running a sinlge file/thread. Why?
What can I do to get the desired 2-3 seconds per file while multiprocessing?
If not possible, any faster solutions?
My approch for the parsing is the following:
with open('cache/' filename, 'rb') as f:
s = BeautifulSoup(f.read(), 'lxml')
s.whatever()
Any faster way to scrape my cached files?
// the multiprocessor:
from concurrent.futures import ThreadPoolExecutor, as_completed
future_list = []
with ThreadPoolExecutor(max_workers=500) as executor:
for filename in os.listdir("cache/"):
if filename.endswith(".html"):
fNametoString = str(filename).replace('.html','')
x = fNametoString.split("_")
EAN = x[0]
SKU = x[1]
future = executor.submit(parser,filename,EAN,SKU)
future_list.append(future)
else:
pass
for f in as_completed(future_list):
pass
CodePudding user response:
Try:
from bs4 import BeautifulSoup
from multiprocessing import Pool
def worker(filename):
with open(filename, "r") as f_in:
soup = BeautifulSoup(f_in.read(), "html.parser")
# do some processing here
return soup.h1.text.strip()
if __name__ == "__main__":
filenames = ["page1.html", "page2.html", ...] # you can use glob module or populate the filenames list other way
with Pool(4) as pool: # 4 is number of processes
for result in pool.imap_unordered(worker, filenames):
print(result)