I have a pandas data frame (named df
) that looks like below:
search_term | fname |
---|---|
banana | fldr1 |
kiwi | fldr2 |
coffee. | fldr3 |
and I'm using the following python code to scrape images in Bing using search_term
, and save those images in folder named in fname
.
from bing_image_downloader import downloader
for index, row in df.iterrows():
print(row['search_term'])
downloader.download(row['search_term'], limit=200, output_dir="FOLDERX", adult_filter_off=True, force_replace=False, timeout=60)
os.rename(os.path.join("FOLDERX",row['search_term']), os.path.join("FOLDERX",row['fname']))
But I'd like to run this in parallel since I have a lot of search terms to go through. For example, if there are 10 search_term
to go through, I'd like the parallel with 2 jobs to split the search terms into 2 and scrape images simultaneously. I'm running this in google colab, and so far have tried
import multiprocessing
from joblib import Parallel, delayed
def scrape_bing(df):
for index, row in df.iterrows():
print(row['search_term'])
downloader.download(row['search_term'], limit=200, output_dir="FOLDERX", adult_filter_off=True, force_replace=False, timeout=60)
os.rename(os.path.join("FOLDERX",row['search_term']), os.path.join("FOLDERX",row['fname']))
Parallel(n_jobs=2)(delayed(scrape_bing)(i, j) for i in range(5) for j in range(2))
But I don't know how to modify the arguments in `delayed' to make it work. Help please?
CodePudding user response:
There is no need to use extra indices for iterating, you can simplify the scrape_bing
and iterate over dataframe rows in Parallel
code.
def scrape_bing(search_term, fname):
downloader.download(search_term, limit=200, output_dir="FOLDERX", adult_filter_off=True, force_replace=False, timeout=60)
os.rename(os.path.join("FOLDERX", search_term), os.path.join("FOLDERX", fname))
Parallel(n_jobs=2)(delayed(scrape_bing)(row['search_term'], row['fname']) for index, row in df.iterrows())